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Abstract 



Given a knowledge base KB containing first-order and statistical facts, we consider a 
principled method, called the random-worlds method, for computing a degree of belief that 
some formula ip holds given KB. If we are reasoning about a world or system consisting of 
N individuals, then we can consider all possible worlds, or first-order models, with domain 
{1, . . . , N} that satisfy KB, and compute the fraction of them in which ip is true. We define 
the degree of belief to be the asymptotic value of this fraction as N grows large. We show 
that when the vocabulary underlying ip and KB uses constants and unary predicates only, 
we can naturally associate an entropy with each world. As N grows larger, there are many 
more worlds with higher entropy. Therefore, we can use a maximum- entropy computation 
to compute the degree of belief. This result is in a similar spirit to previous work in physics 
and artificial intelligence, but is far more general. Of equal interest to the result itself are 
the limitations on its scope. Most importantly, the restriction to unary predicates seems 
necessary. Although the random-worlds method makes sense in general, the connection to 
maximum entropy seems to disappear in the non-unary case. These observations suggest 
unexpected limitations to the applicability of maximum-entropy methods. 

1. Introduction 

Consider an agent (or expert system) with some information about a particular subject, such 
as internal medicine. Some facts, such as "all patients with hepatitis exhibit jaundice", can 
be naturally expressed in a standard first-order logic, while others, such as "80% of patients 
that exhibit jaundice have hepatitis", are statistical. Suppose the agent wants to use this 
information to make decisions. For example, a doctor might need to decide whether to 
administer antibiotics to a particular patient Eric. To apply standard tools of decision 
theory (see (Luce & Raiffa, 1957) for an introduction), the agent must assign probabilities, 
or degrees of belief, to various events. For example, the doctor may need to assign a degree 
of belief to an event such as "Eric has hepatitis" . We would therefore like techniques for 
computing degrees of belief in a principled manner, using all the data at hand. In this paper 
we investigate the properties of one particular formalism for doing this. 

The method we consider, which we call the random-worlds method, has origins that go 
back to Bernoulli and Laplace (1820). It is essentially an application of what has been 
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called the principle of indifference (Keynes, 1921). The basic idea is quite straightforward. 
Suppose we are interested in attaching a degree of belief to a formula (p given a knowledge 
base KB. One useful way of assigning semantics to degrees of belief formulas is to use 
a probability distribution over a set of possible worlds (Halpern, 1990). More concretely, 
suppose for now that we are reasoning about N individuals, 1, . . . , N. A world is a complete 
description of which individuals have each of the properties of interest. Formally, a world 
is just a model, or interpretation, over our first-order language. For example, if our lan- 
guage consists of the unary predicates Hepatitis, Jaundice, Child, and BlueEyed, the binary 
predicate Infected-By, and the constant Eric, then a world describes which subset of the 
N individuals satisfies each of the unary predicates, which set of pairs is in the Infected-By 
relation, and which of the N individuals is Eric. Given a prior probability distribution 
over the set of possible worlds, the agent can obtain a degree of belief in (p given KB by 
conditioning on KB to obtain a posterior distribution, and then computing the probability 
of (p according to this new distribution. The random-worlds method uses the principle of 
indifference to choose a particular prior distribution over the set of worlds: all the worlds 
are taken to be equally likely. It is easy to see that the degree of belief in (p given KB is 
then precisely the fraction of worlds satisfying KB that also satisfy (p. 

The approach so far described applies whenever we actually know the precise domain 
size N; unfortunately this is fairly uncommon. In many cases, however, it is reasonable to 
believe that N is "large" . We are thus particularly interested in the asymptotic behavior of 
this fraction; that is, we take our degree of belief to be the asymptotic value of this fraction 
as N grows large. 

For example, suppose we want to reason about a domain of hospital patients, and KB 
is the conjunction of the following four formulas: 

• \/x(Hepatitis(x) =>■ Jaundice(x)) ("all patients with hepatitis exhibit jaundice"), 

• \\Hepatitis(x)\Jaundice(x)\\ x ~ 0.8 ("approximately 80% of patients that exhibit jaun- 
dice have hepatitis" ; we explain this formalism and the reason we say "approximately 
80%" rather than "exactly 80%" in Section 2), 

• | \BlueEyed(x)\ \ x ~ 0.25 ("approximately 25% of patients have blue eyes"), 

• Jaundice(Eric) A Child(Eric) ("Eric is a child who exhibits jaundice"). 

Let (p be Hepatitis(Eric); that is, we want to ascribe a degree of belief to the statement "Eric 
has hepatitis". Suppose the domain has size N. Then we want to consider all worlds with 
domain {1, . . . , N} such that the set of individuals satisfying Hepatitis is a subset of those 
satisfying Jaundice, approximately 80% of the individuals satisfying Jaundice also satisfy 
Hepatitis, approximately 25% of the individuals satisfy BlueEyed, and (the interpretation 
of) Eric is an individual satisfying Jaundice and Child. It is straightforward to show that, as 
expected, Hepatitis(Eric) holds in approximately 80% of these structures. Moreover, as N 
gets large, the fraction of structures in which Hepatitis(Eric) holds converges to exactly 0.8. 

Since 80% of the patients that exhibit jaundice have hepatitis and Eric exhibits jaundice, 
a degree of belief of 0.8 that Eric has hepatitis seems justifiable. Note that, in this example, 
the information that Eric is a child is essentially treated as irrelevant. We would get the 
same answer if we did not have the information Child(Eric). It can also be shown that 
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the degree of belief in BlueEyed(Eric) converges to 0.25 as N gets large. Furthermore, 
the degree of belief of BlueEyed(Eric) A Jaundice [Eric) converges to 0.2, the product of 
0.8 and 0.25. As we shall see, this is because the random-worlds method treats BlueEyed 
and Jaundice as being independent, which is reasonable because there is no evidence to 
the contrary. (It would surely be strange to postulate that two properties were correlated 
unless there were reason to believe they were connected in some way.) 

Thus, at least in this example, the random-worlds method gives answers that follow 
from the heuristic assumptions made in many standard AI systems (Pearl, 1989; Pollock, 
1984; Spiegelhalter, 1986). Are such intuitive results typical? When do we get convergence? 
And when we do, is there a practical way to compute degrees of belief? 

The answer to the first question is yes, as we discuss in detail in (Bacchus, Grove, 
Halpern, & Roller, 1994). In that paper, we show that the random- worlds method is re- 
markably successful at satisfying the desiderata of both nonmonotonic (default) reasoning 
(Ginsberg, 1987) and reference class reasoning (Kyburg, 1983). The results of (Bacchus 
et al., 1994) show that the behavior we saw in the example above holds quite generally, 
as do many other properties we would hope to have satisfied. Thus, in this paper we do 
not spend time justifying the random- worlds approach, nor do we discuss its strengths and 
weaknesses; the reader is referred to (Bacchus et al., 1994) for such discussion and for 
an examination of previous work in the spirit of random worlds (most notably (Carnap, 
1950, 1952) and subsequent work). Rather, we focus on the latter two questions asked 
above. These questions may seem quite familiar to readers aware of the work on asymp- 
totic probabilities for various logics. For example, in the context of first-order formulas, 
it is well-known that a formula with no constant or function symbols has an asymptotic 
probability of either or 1 (Fagin, 1976; Glebskii, Kogan, Liogon'kii, & Talanov, 1969). 
Furthermore, we can decide which (Grandjean, 1983). However, the 0-1 law fails if the 
language includes constants or if we look at conditional probabilities (Fagin, 1976), and we 
need both these features in order to reason about degrees of belief for formulas involving 
particular individuals, conditioned on what is known. 

In two companion papers (Grove, Halpern, & Roller, 1993a, 1993b), we consider the 
question of what happens in the pure first-order case (where there is no statistical informa- 
tion) in greater detail. We show that as long as there is at least one binary predicate symbol 
in the language, then not only do we not get asymptotic conditional probabilities in general 
(as was already shown by Fagin (1976)), but almost all the questions one might want to ask 
(such as deciding whether the limiting probability exists) are highly undecidable. However, 
if we restrict to a vocabulary with only unary predicate symbols and constants, then as 
long as the formula on which we are conditioning is satisfiable in arbitrarily large models 
(a question which is decidable in the unary case), the asymptotic conditional probability 
exists and can be computed effectively. 

In this paper, we consider the much more useful case where the knowledge base has 
statistical as well as first-order information. In light of the results of (Grove et al., 1993a, 
1993b), for most of the paper we restrict attention to the case when the knowledge base is 
expressed in a unary language. Our major result involves showing that asymptotic condi- 
tional probabilities can often be computed using the principle of maximum entropy (Jaynes, 
1957; Shannon & Weaver, 1949). 
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To understand the use of maximum entropy, suppose the vocabulary consists of the 
unary predicate symbols Pi, . . . , P&. We can consider the 2 k atoms that can be formed from 
these predicate symbols, namely, the formulas of the form P[ A . . . A P^, where each P- is 
either Pi or -iP^. We can view the knowledge base as placing constraints on the proportion 
of domain elements satisfying each atom. For example, the constraint ||Pi(a;)|P2(a;)|| a ; = 1/2 
says that the proportion of the domain satisfying some atom that contains P2 as a conjunct 
is twice the proportion satisfying atoms that contain both P\ and P2 as conjuncts. Given a 
model of KB, we can define the entropy of this model as the entropy of the vector denoting 
the proportions of the different atoms. We show that, as N grows large, there are many 
more models with high entropy than with lower entropy. Therefore, models with high 
entropy dominate. We use this concentration phenomenon to show that our degree of belief 
in (p given KB according to the random-worlds method is closely related to the assignment 
of proportions to atoms that has maximum entropy among all assignments consistent with 
the constraints imposed by KB. 

The concentration phenomenon relating entropy to the random-worlds method is well- 
known (Jaynes, 1982, 1983). In physics, the "worlds" are the possible configurations of 
a system typically consisting of many particles or molecules, and the mutually exclusive 
properties (our atoms) can be, for example, quantum states. The corresponding entropy 
measure is at the heart of statistical mechanics and thermodynamics. There are subtle but 
important differences between our viewpoint and that of the physicists. The main one lies in 
our choice of language. We want to express some intelligent agent's knowledge, which is why 
we take first-order logic as our starting point. The most specific difference concerns constant 
symbols. We need these because the most interesting questions for us arise when we have 
some knowledge about — and wish to assign degrees of belief to statements concerning — a 
particular individual. The parallel in physics would address properties of a single particle, 
which is generally considered to be well outside the scope of statistical mechanics. 

Another work that examines the connection between random worlds and entropy from 
our point of view — computing degrees of belief for formulas in a particular logic — is that of 
Paris and Vencovska (1989). They restrict the knowledge base to consist of a conjunction of 
constraints that (in our notation) have the form ||a(a;)|/3(a;)|| :r ~ rand | |ce(£c)| |a; ~ r i where f3 
and a are quantifier-free formulas involving unary predicates only, with no constant symbols. 
Not only is most of the expressive power of first-order logic not available in their approach, 
but the statistical information that can be expressed is quite limited. For example, it is not 
possible to make general assertions about statistical independence. Paris and Vencovska 
show that the degree of belief can be computed using maximum entropy for their language. 
Shastri (1989) has also shown such a result, of nearly equivalent scope. But, as we have 
already suggested, we believe that it is important to look at a far richer language. Our 
language allows arbitrary first-order assertions, full Boolean logic, arbitrary polynomial 
combinations of statistical expressions, and more; these are all features that are actually 
useful to knowledge-representation practitioners. Furthermore, the random-worlds method 
makes perfect sense in this rich setting. The goal of this paper is to discover whether the 
connection to maximum entropy also holds. We show that maximum entropy continues 
to be widely useful, covering many problems that are far outside the scope of (Paris & 
Vencovska, 1989; Shastri, 1989). 
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On the other hand, it turns out that we cannot make this connection for our entire 
language. For one thing, as we hinted earlier, there are problems if we try to condition on a 
knowledge base that includes non-unary predicates; we suspect that maximum entropy has 
no role whatsoever in this case. In addition, we show that there are subtleties that arise 
involving the interaction between statistical information and first-order quantification. We 
feel that an important contribution of this paper lies in pointing out some limitations of 
maximum-entropy methods. 

The rest of this paper is organized as follows. In the next section, we discuss our formal 
framework (essentially, that of (Bacchus, 1990; Halpern, 1990)). We discuss the syntax 
and semantics of statistical assertions, issues involving "approximately equals", and define 
the random-worlds method formally. In Section 3 we state the basic results that connect 
maximum entropy to random- worlds, and in Section 4 we discuss how to use these results 
as effective computational procedures. In Section 5 we return to the issue of unary versus 
non-unary predicates, and the question of how widely applicable the principle of maximum 
entropy is. We conclude in Section 6 with some discussion. 

2. Technical preliminaries 

In this section, we give the formal definition of our language and the random- worlds method. 
The material is largely taken from (Bacchus et al., 1994). 

2.1 The language 

We are interested in a formal logical language that allows us to express both statistical 
information and first-order information. We therefore define a statistical language £~, 
which is a variant of a language designed by Bacchus (1990). For the remainder of the 
paper, let $ be a finite first-order vocabulary, consisting of predicate and constant symbols, 
and let X be a set of variables. 1 

Our statistical language augments standard first-order logic with a form of statistical 
quantifier. For a formula i>(x), the term | |-0(cc)| 1^ is a proportion expression. It will be 
interpreted as a rational number between and 1, that represents the proportion of domain 
elements satisfying ip(x). We actually allow an arbitrary set of variables in the subscript 
and in the formula ip. Thus, for example, \\Child(x,y)\\ x describes, for a fixed y, the 
proportion of domain elements that are children of y; \\Child(x,y)\\ y describes, for a fixed 
x, the proportion of domain elements whose child is x; and 1 1 Child(x, y)\ \ x<y describes the 
proportion of pairs of domain elements that are in the child relation. 2 

We also allow proportion expressions of the form ||^(a;)|0(a;)|| a ;, which we call conditional 
proportion expressions. Such an expression is intended to denote the proportion of domain 
elements satisfying ip from among those elements satisfying 9. Finally, any rational number 
is also considered to be a proportion expression, and the set of proportion expressions is 
closed under addition and multiplication. 

1. For simplicity, we assume that $ does not contain function symbols, since these can be defined in terms 
of predicates. 

2. Strictly speaking, these proportion expression should be written with sets of variables in the subscript, 
as in 1 1 Child(x, y)\ \{v lV y. However, when the interpretation is clear, we often abuse notation and drop 
the set delimiters. 
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One important difference between our syntax and that of (Bacchus, 1990) is the use of 
approximate equality to compare proportion expressions. There are both philosophical and 
practical reasons why exact comparisons can be inappropriate. Consider a statement such 
as "80% of patients with jaundice have hepatitis". If this statement appears in a knowledge 
base, it is almost certainly there as a summary of a large pool of data. So it would be wrong 
to interpret the value too literally, to mean that exactly 80% of all patients with jaundice 
have hepatitis. Furthermore, this interpretation would imply (among other things) that the 
number of jaundiced patients is a multiple of five! This is unlikely to be something we intend. 
We therefore use the approach described in (Bacchus et al., 1994; Koller & Halpern, 1992), 
and compare proportion expressions using (instead of = and <) one of an infinite family of 
connectives ~i and <i, for i = 1,2,3 .. . ("i-approximately equal" or "i-approximately less 
than or equal"). For example, we can express the statement "80% of jaundiced patients 
have hepatitis" by the proportion formula \\Hep(x)\Jaun(x)\\ x ~i 0.8. The intuition behind 
the semantics of approximate equality is that each comparison should be interpreted using 
some small tolerance factor to account for measurement error, sample variations, and so 
on. The appropriate tolerance will differ for various pieces of information, so our logic 
allows different subscripts on the "approximately equals" connectives. A formula such as 
\\Fly{x)\Bird{x)\\ x 1 A \\Fly{x)\Bat{x)\\ x k- 2 1 says that both \\Fly{x)\Bird{x)\\ x and 
||i ? /j/(a;)|5ai(a;)|| :r are approximately 1, but the notion of "approximately" may be different 
in each case. The actual choice of subscript for ~ is unimportant. However, it is important 
to use different subscripts for different approximate comparisons unless the tolerances for 
the different measurements are known to be the same. 

We can now give a recursive definition of the language £~. 

Definition 2.1: The set of terms in £~ is X U C where C is the set of constant symbols in 
$. The set of proportion expressions is the least set that 

(a) contains the rational numbers, 

(b) contains proportion terms of the form HV'llx an d HV'l^llx f° r formulas ip,9 £ £~ and 
a finite set of variables X C X, and 

(c) is closed under addition and multiplication. 
The set of formulas in £~ is the least set that 

(a) contains atomic formulas of the form R(ti, . . . , t r ), where R is a predicate symbol in 
$ U { = } of arity r and t\, . . . , t r are terms, 

(b) contains proportion formulas of the form ( ~i (' and ( <i where ( and (' are 
proportion expressions and i is a natural number, and 

(c) is closed under conjunction, negation, and first-order quantification. | 

Note that £~ allows the use of equality when comparing terms, but not when comparing 
proportion expressions. 

This definition allows arbitrary nesting of quantifiers and proportion expressions. As 
observed in (Bacchus, 1990), the subscript a; in a proportion expressions binds the variable 
x in the expression; indeed, we can view 1 1 -| 1^ as a new type of quantification. 
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We now need to define the semantics of the logic. As we shall see below, most of the 
definitions are fairly straightforward. The two features that cause problems are approxi- 
mate comparisons and conditional proportion expressions. We interpret the approximate 
connective ( ~i (' to mean that ( is very close to (' . More precisely, it is within some very 
small tolerance factor. We formalize this using a tolerance vector f = {t\,T2, . . .), T{ > 0. 
Intuitively ( ~i (' if the values of ( and (' are within T{ of each other. Of course, one prob- 
lem with this is that we generally will not know the value of T{. We postpone discussion of 
this issue until the next section. 

Another difficulty arises when interpreting conditional proportion expressions. The 
problem is that HV^IIx cannot be defined as a conditional probability when there are 
no assignments to the variables in X that would satisfy 9, because we cannot divide by 
zero. When standard equality is used rather than approximate equality this problem is 
easily overcome, simply by avoiding conditional probabilities in the semantics altogether. 
Following (Halpern, 1990), we can eliminate conditional proportion expressions altogether 
by viewing a statement such as HV^IIx = a as an abbreviation for HV'A^llx = a||#||x- 
Thus, we never actually form quotients of probabilities. This approach agrees completely 
with the standard interpretation of conditionals so long as \\9\\x 7^ 0. If \\@\\x = 0, it 
enforces the convention that formulas such as HV^IIx = a or HV'l^llx < a are true for any 
a. (Note that we do not really care much what happens in such cases, so long as it is 
consistent and well-defined. This convention represents one reasonable choice.) 

We used the same approach in an earlier version of this paper (Grove, Halpern, & 
Roller, 1992) in the context of a language that uses approximate equality. Unfortunately, 
as the following example shows, this has problems. Unlike the case for true equality, if we 
multiply by \\9\\x to clear all quotients, we do not obtain an equivalent formula even if 
\\0\\x is nonzero. 




Example 2.2: First consider the knowledge base KB = (\\Fly(x)\Penguin(x)\\ x ~i 0). 
This says that the number of flying penguins forms a tiny proportion of all penguins. 
However, if we interpret conditional proportions as above and multiply out, we obtain the 
knowledge base KB 1 = \\Fly(x) A Penguin(x)\ \ x ~i • \\Penguin(x)\\ x , which is equivalent 
to \ \Fly(x) A Penguin(x)\ \ x ~i 0. KB 1 just says that the number of flying penguins is small, 
and has lost the (possibly important) information that the number of flying penguins is 
small relative to the number of penguins. It is quite consistent with KB 1 that all penguins fly 
(provided the total number of penguins is small); this is not consistent with KB. Clearly, the 
process of multiplying out across an approximate connective does not preserve the intended 
interpretation of the formulas. | 

This example demonstrates an undesirable interaction between the semantics we have 
chosen for approximate equality and the process of multiplying-out to eliminate conditional 
proportions. We expect HV^IIx ~i a to mean that HV^IIx is within some tolerance T\ of a. 
Assuming | \Q\ \x > 0, this is the same as saying that | \ip A 9\\x is within T\ \ \x of a \ \9\ \x- 
On the other hand, the expression that results by multiplying out is A 9\\x ~i a||#||x- 
This says that H^A^H^ is within T\ (not T\ ||#||x0 of a||#||x- As we saw above, the 
difference between the two interpretations can be significant. 

Because of this problem, we cannot treat conditional proportions as abbreviations and 
instead have added them as primitive expressions in the language. Of course, we now have 
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to give them a semantics that avoids the problem illustrated by Example 2.2. We would 
like to maintain the conventions used when we had equality in the language. Namely, 
in worlds where ^(a;)^ ^ 0, we want ||^(a;)|0(a;)|| a ; to denote the fraction of elements 
satisfying 9(x) that also satisfy ip(x). In worlds where ^(a;)^ = 0, we want formulas 
of the form ||^(a;)|0(a;)|| a ; ~i a or ||^(a;)|0(a;)|| a ; <i a to be true. There are a number of 
ways of accomplishing this. The way we take is perhaps not the simplest, but it introduces 
machinery that will be helpful later. The basic idea is to make the interpretation of ~ more 
explicit, so that we can eliminate conditional proportions by multiplication and keep track 
of all the consequences of doing so. 

We give semantics to the language £~ by providing a translation from formulas in £~ 
to formulas in a language C = whose semantics is more easily described. The language C = is 
essentially the language of (Halpern, 1990), that uses true equality rather than approximate 
equality when comparing proportion expressions. More precisely, the definition of C = is 
identical to the definition of £~ given in Definition 2.1, except that: 

• we use = and < instead of ~i and <i, 

• we allow the set of proportion expressions to include arbitrary real numbers (not just 
rational numbers), 

• we do not allow conditional proportion expressions, 

• we assume that C = has a special family of variables Ei, for i = 1,2, . . ., interpreted 
over the reals. 

The variable Ei is used to explicitly interpret the approximate equality connectives ~i and 
<i. Once this is done, we can safely multiply out the conditionals, as described above. More 
precisely, every formula % £ £~ can be associated with a formula %* £ C = as follows: 

• every proportion formula ( <i (' in % is (recursively) replaced by ( — (' < Si, 

• every proportion formula ( ~i (' in % is (recursively) replaced by the conjunction 

(C-C<e l )A(C-(<s l ), 

• finally, conditional proportion expressions are eliminated by multiplying out. 

This translation allows us to embed £~ into C = . Thus, for the remainder of the paper, 
we regard £~ as a sublanguage of C = . This embedding avoids the problem encountered in 
Example 2.2, because when we multiply to clear conditional proportions the tolerances are 
explicit, and so are also multipled as appropriate. 

The semantics for C = is quite straightforward, and is similar to that in (Halpern, 1990). 
We give semantics to C = in terms of worlds, or finite first-order models. For any natural 
number N, let Wn consist of all worlds with domain {1, . . .,-A^}. Thus, in Wn, we have 
one world for each possible interpretation of the symbols in $ over the domain {1, . . . , N}. 
Let W* denote U N W N - 

Now, consider some world W £ W* over the domain D = {1, . . - ,N}, some valuation 
V : X — > D for the variables in X , and some tolerance vector f. We simultaneously assign 
to each proportion expression ( a real number [C](w,v r ,T) an( i to each formula £ a truth value 
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with respect to (W, V, f). Most of the clauses of the definition are completely standard, so we 
omit them here. In particular, variables are interpreted using V, the tolerance variables Ei 
are interpreted using the tolerances Ti, the predicates and constants are interpreted using W, 
the Boolean connectives and the first-order quantifiers are defined in the standard fashion, 
and when interpreting proportion expressions, the real numbers, addition, multiplication, 
and < are given their standard meaning. It remains to interpret proportion terms. Recall 
that we eliminate conditional proportion terms by multiplying out, so that we need to deal 
only with unconditional proportion terms. If ( is the proportion expression H^IU^,...,^ 
(for i\ < %2 < ■ ■ ■ < ik), then 

{(di,...,d k )e D k : (W,V[xiJdi,...,x ih /d k ],T) \= ip} . 

Thus, if \D\ = N, the proportion expression H^IU^,...,^ denotes the fraction of the N k 
fc-tuples in D k that satisfy ip. For example, [| | Child(x, y)\ \ x ](w,V,f) 1S the fraction of domain 
elements d that are children of V(y). 

Using our embedding of £~ into £ = , we now have semantics for £~. For % £ £~, we 
say that (W,V,f) |= % iff (W,V,f) |= %*. It is sometimes useful in our future results to 
incorporate particular values for the tolerances into the formula %*. Thus, let %[f] represent 
the formula that results from %* if each variable Si is replaced with its value according to 
r, that is, r^. 3 

Typically we are interested in closed sentences, that is, formulas with no free variables. 
In that case, it is not hard to show that the valuation plays no role. Thus, if % is closed, 
we write (W, f) |= % rather than (W, V, f) |= %. Finally, if KB and % are closed formulas, 
we write KB |= % if (W, f) |= KB implies (W, f) |= %• 



[C]( 



W,V,t) 



2.2 Degrees of belief 

As we explained in the introduction, we give semantics to degrees of belief by considering all 
worlds of size N to be equally likely, conditioning on KB, and then checking the probability 
of (p over the resulting probability distribution. In the previous section, we defined what it 
means for a sentence % to be satisfied in a world of size N using a tolerance vector f. Given 
N and f, we define ^worlds^(x) to be the number of worlds in Wn such that (W, f) |= %. 
Since we are taking all worlds to be equally likely, the degree of belief in (p given KB with 
respect to Wn an d t is 

Pr , MKB) _ H*(?A KB) 



#worlds T N (KB) 

If ^worlds^(KB) = 0, this degree of belief is not well-defined. 

The careful reader may have noticed a potential problem with this definition. Strictly 
speaking, we should write Wjv($) rather than Wn, since the set of worlds under consider- 
ation clearly depends on the vocabulary. Hence, the number of worlds in Wn also depends 
on the vocabulary. Thus, both ^worlds^((p) and ^worlds^((p A KB) depend on the choice 



3. Note that some of the tolerances r; may be irrational; it is for this reason that we allowed irrational 
numbers in proportion expressions in C~ . 
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of $. Fortunately, this dependence "cancels out": If $' D $, then there is a constant c 
such that for all formulas % over the vocabulary $, #[<l>']worlds 1 N (x) = c#[<l>]worlds'^(x) ■ 
This result, from which it follows that the degree of belief Pr^-(</?|iT.B) is independent of 
our choice of vocabulary, is proved in (Grove et al., 1993b). 

Typically, we know neither N nor f exactly. All we know is that N is "large" and 
that f is "small". Thus, we would like to take our degree of belief in (p given KB to 
be lim^g limjy^oo Prjv^vl-^O- Notice that the order of the two limits over f and N 
is important. If the limit lim^g appeared last, then we would gain nothing by using 
approximate equality, since the result would be equivalent to treating approximate equality 
as exact equality. 

This definition, however, is not sufficient; the limit may not exist. We observed above 
that Fx^((p\KB) is not always well-defined. In particular, it may be the case that for 
certain values of f, Pr^-(</?|iT.B) is not well-defined for arbitrarily large N. In order to 
deal with this problem of well-deflnedness, we define KB to be eventually consistent if 
for all sufficiently small f and sufficiently large N, #worlds^(KB) > 0. Among other 
things, eventual consistency implies that the KB is satisfiable in finite domains of arbitrarily 
large size. For example, a KB stating that "there are exactly 7 domain elements" is not 
eventually consistent. For the remainder of the paper, we assume that all knowledge bases 
are eventually consistent. In practice, we expect eventual consistency to be no harder 
to check than consistency. We do not expect a knowledge base to place bounds on the 
domain size, except when the bound is readily apparent. For those unsatisfied with this 
intuition, it is also possible to find formal conditions ensuring eventual consistency. For 
instance, it is possible to show that the following conditions are sufficient to guarantee 
that KB is eventually consistent: (a) KB does not use any non-unary predicates, including 
equality between terms and (b) KB is consistent for some domain size when all approximate 
comparisons are replaced by exact comparisons. Since we concentrate on unary languages 
in this paper, this result covers most cases of interest. 

Even if KB is eventually consistent, the limit may not exist. For example, it may be 
the case that Pr^-(</?|iT.B) oscillates between a + T{ and a — T{ for some i as N gets large. In 
this case, for any particular f, the limit as N grows will not exist. However, it seems as if 
the limit as f grows small should, in this case, be a, since the oscillations about a go to 0. 
We avoid such problems by considering the lim sup and lira inf, rather than the limit. For 
any set S C M, the infimum of S, inf S, is the greatest lower bound of S. The lira inf 'of a 
sequence is the limit of the inflmums; that is, 

liminf ajv = lim infja^ : i > N}. 

AT— >oo AT— >oo 

The lim inf exists for any sequence bounded from below, even if the limit does not. The lim 
sup is defined analogously, where sup S denotes the least upper bound of S. If lim at ajy 
does exist, then limjv-Kx, = lim inf n^oo a N = li m su Pjv"->oo a N- Since, for any f, the 
sequence Fx^((p\KB) is always bounded from above and below, the lim sup and lim inf 
always exist. Thus, we do not have to worry about the problem of nonexistence for particular 
values of f. We can now present the final form of our definition. 

Definition 2.3: If 

lim lim inf Fi^((p\KB) and lim lim sup Fi^((p\KB) 
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both exist and are equal, then the degree of belief in (p given KB, written Pr 00 (</3|iT5), is 
defined as the common limit; otherwise Pr 00 (</3|iT5) does not exist. 

We close this section with a few remarks on our definition. First note that, even using 
this definition, there are many cases where the degree of belief does not exist. However, 
as some of our later examples show, in many situations the nonexistence of a degree of 
belief can be understood intuitively (for instance, see Example 4.3 and the subsequent 
discussion). We could, alternatively, have taken the degree of belief to be the interval 
defined by lim^g lim inf jv"_»oo Pr^-(</?|iT.B) and lim^g limsup^-^^ Pr^-(</?|iT.B), provided 
each of them exist. This would have been a perfectly reasonable choice; most of the results 
we state would go through with very little change if we had taken this definition. Our 
definition simplifies the exposition slightly. 

Finally, we remark that it may seem unreasonable to take limits if we know the domain 
size or have a bound on the domain size. Clearly, if we know N and f, then it seems more 
reasonable to use Prj^ rather than Pr^ as our degree of belief. Indeed, as shown in (Bacchus 
et al., 1994), many of the important properties that hold for the degree of belief defined 
by Proo hold for Pr]^-, for all choices of N and f. The connection to maximum entropy 
that we make in this paper holds only at the limit, but because (as our proofs show) the 
convergence is rapid, the degree of belief Pr 00 (</3|iT5) is typically a very good approximation 
to Fx^^fflKB), even for moderately large N and moderately small f. 

3. Degrees of belief and entropy 

3.1 Introduction to maximum entropy 

The idea of maximizing entropy has played an important role in many fields, including 
the study of probabilistic models for inferring degrees of belief (Jaynes, 1957; Shannon & 
Weaver, 1949). In the simplest setting, we can view entropy as a real- valued function on 
finite probability spaces. If is a finite set and ll is a probability measure on 0, the entropy 
H(/x) is defined to be — X^en M^O m M^O ( we take Oln = 0). 

One standard application of entropy is the following. Suppose we know the space 0, but 
have only partial information about ll, expressed in the form of constraints. For example, 
we might have a constraint such as lx{lo\) + ll{^2) > 1/3. Although there may be many 
measures ll that are consistent with what we know, the principle of maximum entropy 
suggests that we adopt that ll* which has the largest entropy among all the consistent 
possibilities. Using the appropriate definitions, it can be shown that there is a sense in 
which this ll* incorporates the "least" additional information (Shannon & Weaver, 1949). 
For example, if we have no constraints on ll, then fx* will be the measure that assigns equal 
probability to all elements of 0. Roughly speaking, ll* assigns probabilities as equally as 
possible given the constraints. 

3.2 From formulas to constraints 

Like maximum entropy, the random- worlds method is also used to determine degrees of be- 
lief (i.e., probabilities) relative to a knowledge base. Aside from this, is there any connection 
between the two ideas? Of course, there is the rather trivial observation that random- worlds 
considers a uniform probability distribution (over the set of worlds satisfying KB), and it is 
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well-known that the uniform distribution over any set has the highest possible entropy. But 
in this section we show another, entirely different and much deeper, connection between 
random- worlds and the principle of maximum entropy. This connection holds provided that 
we restrict the knowledge base so that it uses only unary predicates and constants. In this 
case we can consider probability distributions, and in particular the maximum-entropy dis- 
tribution, over the set of atoms. Atoms are of course very different from possible worlds; 
for instance, there are only finitely many of them (independent of the domain size N). 
Furthermore, the maximum-entropy distributions we consider will typically not be uniform. 
Nevertheless, maximum entropy in this new space can tell us a lot about the degrees of 
belief defined by random worlds. In particular, this connection will allow us to use maxi- 
mum entropy as a tool for computing degrees of belief. We believe that the restriction to 
unary predicates is necessary for the connection we are about to make. Indeed, as long as 
the knowledge base makes use of a binary predicate symbol (or unary function symbol), we 
suspect that there is no useful connection between the two approaches at all; see Section 5 
for some discussion. 

Let £f be the sublanguage of £~ where only unary predicate symbols and constant 
symbols appear in formulas; in particular, we assume that equality between terms does not 
occur in formulas in £*. 4 (Recall that in £~, we allow equality between terms, but disallow 
equality between proportion expressions.) Let be the corresponding sublanguage of 
£ = . In this subsection, we show that the expressive power of a knowledge base KB in the 
language Cf is quite limited. In fact, such a KB can essentially only place constraints on the 
proportions of the atoms. If we then think of these as constraints on the "probabilities of the 
atoms", then we have the ingredients necessary to apply maximum entropy. In Section 3.3 
we show that there is a strong connection between the maximum-entropy distribution found 
this way and the degree of belief generated by random- worlds method. 

To see what constraints a formula places on the probabilities of atoms, it is useful to 
convert the formula to a certain canonical form. As a first step to doing this, we formalize 
the definition of atom given in the introduction. Let V = {Pi, . . . , Pfc} consist of the unary 
predicate symbols in the vocabulary $. 

Definition 3.1: An atom (over V) is conjunction of the form P{(x) A ... A P^x), where 
each P[ is either Pi or -iP^. Since the variable x is irrelevant to our concerns, we typically 
suppress it and describe an atom as a conjunction of the form P[ A . . . A P^. I 

Note that there are 

2 \T\ = 2 k 

atoms over V and that they are mutually exclusive and 
exhaustive. Throughout this paper, we use K to denote 2 k and A\, . . .,Ak to denote the 
atoms over V, listed in some fixed order. 

Example 3.2: There are K = 4 atoms over V = {Pi,P 2 }: A\ = P\ A P 2 , A 2 = P\ A -.P 2 , 
A 3 = ^Pi A P 2 , A 4 = ^Pi A ^P 2 . | 

The atomic proportion terms \\Ai(x)\\ x , . . . ,\\Ak(x)\\ x will play a significant role in 
our technical development. It turns out that is a rather weak language: a formula 
KB £ Cf does little more than constrain the proportion of the atoms. In other words, for 

4. We remark that many of our results can be extended to the case where the KB mentions equality, but 
the extra complexity obscures many of the essential ideas. 
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any such KB we can find an equivalent formula in which the only proportion expressions 
are these unconditional proportions of atoms. The more complex syntactic machinery in 
£f — proportions over tuples, first-order quantification, nested proportions, and conditional 
proportions — does not add expressive power. (It does add convenience, however; knowledge 
can often be expressed far more succinctly if the full power of the language is used.) 

Given any KB, the first step towards applying maximum entropy is to use £f s lack of 
expressivity and replace all proportion terms by atomic proportion terms. It is also useful 
to make various other simplifications to KB that will help us in Section 4. We combine 
these steps and require that KB be transformed into a special canonical form which we now 
describe. 

Definition 3.3: An atomic term t over V is a polynomial over terms of the form | |A(a;)| 1^, 
where A is an atom over V . Such an atomic term t is positive if every coefficient of the 
polynomial t is positive. | 

Definition 3.4: A (closed) sentence % £ £f is in canonical form if it is a disjunction of 
conjunctions, where each conjunct is one of the following: 

• t' = 0, {t 1 > A t < t'si), or (t 1 > A -i(t < t'si)), where t and t' are atomic terms and 
t' is positive, 

• 3a; Ai(x) or -i3a; Ai(x) some atom Ai, or 

• Ai(c) for some atom Ai and some constant c. 

Furthermore, a disjunct cannot contain both Ai(c) and Aj(c) for i ^ j as conjuncts, nor can 
it contain both Ai(c) and -i3a; Ai(x). (Note that these last conditions are simply minimal 
consistency requirements.) | 

Theorem 3.5: Every formula in £f is equivalent to a formula in canonical form. More- 
over, there is an effective procedure that, given a formula £ £ > constructs an equivalent 
formula £ in canonical form. 

The proof of this theorem, and of all theorems in this paper, can be found in the appendix. 

We remark that the length of the formula £ is typically exponential in the length of £. 
Such a blowup seems inherent in any scheme defined in terms of atoms. 

Theorem 3.5 is a generalization of Claim 5.7.1 in (Halpern, 1990). It, in turn, is a 
generalization of a well-known result which says that any first-order formula with only unary 
predicates is equivalent to one with only depth-one quantifier nesting. Roughly speaking, 
this is because for a quantified formula such as 3a; subformulas talking about a variable 
y other than x can be moved outside the scope of the quantifier. This is possible because 
no literal subformula can talk about x and y together. Our proof uses the same idea and 
extends it to proportion statements. In particular, it shows that for any £ £ Cf there is an 
equivalent £ which has no nested quantifiers or nested proportions. 

Notice, however, that such a result does not hold once we allow even a single binary 
predicate in the language. For example, the formula \/y3xR(x,y) clearly needs nested 
quantification because R(x,y) talks about both x and y and so must remain within the 
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scope of both quantifiers. With binary predicates, each additional depth of nesting really 
does add expressive power. This shows that there can be no "canonical form" theorem quite 
like Theorem 3.5 for richer languages. This issue is one of the main reasons why we restrict 
the KB to a unary language in this paper. (See Section 5 for further discussion.) 

Given any formula in canonical form we can immediately derive from it, in a syntactic 
manner, a set of constraints on the possible proportions of atoms. 

Definition 3.6: Let KB be in canonical form. We construct a formula T(KB) in the lan- 
guage of real closed fields (i.e., over the vocabulary {0, 1, +, x}) as follows, where u\, . . . , uk 
are fresh variables (distinct from the tolerance variables Sj): 

• we replace each occurrence of the formula Ai(c) by U{ > 0, 

• we replace each occurrence of 3a; A{{x) by U{ > and replace each occurrence of 
— i3a; A{{x) by U{ = 0, 

• we replace each occurrence of | |Ai(a;)| \ x by U{. | 

Notice that T(KB) has two types of variables: the new variables U{ that we just introduced, 
and the tolerance variables e^. In order to eliminate the dependence on the latter, we often 
consider the formula T(KB[f]) for some tolerance vector f. 

Definition 3.7: Given a formula 7 over the variables u\, . . .,uk, let 5oZ[7] be the set of 
vectors in A K = {u £ [0, 1] K : Y^f u i = 1} satisfying 7. Formally, if (ai, . . . ,or-) £ A^, 
then (ai, . . . , or-) £ <SoZ[7] iff (-K, V) |= 7, where V is a valuation such that V(u{) = a{. | 

Definition 3.8: The solution space of KB given f, denoted S T [KB], is defined to be the 
closure of SoI[T(KB[t])]. 5 I 

If KB is not in canonical form, we define V{KB) and S ¥ [KB] to be V{KB) and S ¥ [KB], 
respectively, where KB is the formula in canonical form equivalent to KB obtained by the 
procedure appearing in the proof of Theorem 3.5. 

Example 3.9: Let V be {Pi,P2}, with the atoms ordered as in Example 3.2. Consider 

KB = Vz Pi(z) A 3||Pi(z) A P 2 (x)\\ x 1. 
The canonical formula KB equivalent to KB is: 6 



As expected, KB constrains both | [^(a;)! ^ and | |A4(a;)| ^ (i.e., U3 and U4) to be 0. We also 
see that ||Ai(a;)|| a ; (i.e., u\) is (approximately) at most 1/3. Therefore: 



5. Recall that the closure of a set X C 1R consists of all if -tuples that are the limit of a sequence of 
if -tuples in X. 

6. Note that here we are viewing KB as a formula in L~ , under the translation defined earlier; we do this 
throughout the paper without further comment. 



-i3x A 3 (x) A -i3x Ai{x) A 3| |Ai(a;)| \ x - 1 < e t . 




I 
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3.3 The concentration phenomenon 

With every world W £ W*, we can associate a particular tuple (tii, 
the fraction of the domain satisfying atom A{ in W: 

Definition 3.10: Given a world W £ W* , we define ir(W) £ A K to be 

(\\A 1 (x)\\ x ,\\A 2 (x)\\ x ,...,\\A K (x)\\ x ) 

where the values of the proportions are interpreted over W. We say that the vector ir(W) 
is the point associated with W. I 

We define the entropy of any model W to be the entropy of ir(W); that is, if ir(W) = 
(til, • • • , u k)i then the entropy of W is H(u\, . . . , uk)- As we are about to show, the entropy 
of u turns out to be a very good asymptotic indicator of how many worlds W there are such 
that 7r(W) = u. In fact, there are so many more worlds near points of high entropy that 
we can ignore all the other points when computing degrees of belief. This concentration 
phenomenon, as Jaynes (1982) has called it, is essentially the content of the next lemma 
and justifies our interest in the maximum-entropy point(s) of S T [KB]. 

For any S C A K let j^worlds T N [S](KB) denote the number of worlds W of size N 
such that (W,f) |= KB and such that ir(W) £ S; for any u £ A K let #worlds^[u](KB) 
abbreviate ^worlds^[{u}](KB) . Of course ^worlds^[u](KB) is necessarily zero unless all 
components of u are multiples of 1/jV. However, if there are any models associated with u 
at all, we can estimate their number quite accurately using the entropy function: 

Lemma 3.11: There exist some function h : M — ► M and two strictly positive polynomial 
functions f, g : W — ► M such that, for KB £ Cf and u £ A K , if j^worlds T N \u][KB) ^ 0, 
then 

(h(N)lf(N))e NH ^ < #worlds f N [u](KB) < h(N)g(N)e NH ^ . 

Of course, it follows from the lemma that tuples whose entropy is near maximum have 
overwhelmingly more worlds associated with them than tuples whose entropy is further 
from maximum. This is essentially the concentration phenomenon. 

Lemma 3.11 is actually fairly easy to prove. The following simple example illustrates 
the main idea. 



. . . , uk), where U{ is 



Example 3.12: Suppose $ = {P} and KB = true. We have 

A K = A 2 = {(ui,l-ui) : 0<ui<l}, 

where the atoms are A\ = P and A 2 = ->P. For any N, partition the worlds in Wn 
according to the point to which they correspond. For example, the graph in Figure 1 shows 
us the partition of W4. In general, consider some point u = (r/N, (N — r)/N). The number 
of worlds corresponding to u is simply the number of ways of choosing the denotation of 
P. We need to choose which r elements satisfy P; hence, the number of such worlds is 
(^) = r !(jv- t .)! • Figure 2 shows the qualitative behavior of this function for large values of 
N. It is easy to see the asymptotic concentration around u = (0.5,0.5). 
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Figure 1: Partition of W4 according to tt(W). 
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Figure 2: Concentration phenomenon for worlds of size N. 



We can estimate the factorials appearing in this expression using Stirling's approx- 
imation, which asserts that the factorial m! is approximately m m = e mlnm . So, after 
substituting for the three factorials, we can estimate (f) as e Nlo g N-(rlo g r+(N-r)lo g (N-r)) ^ 

which reduces to e NH ( u ) . The entropy term in the general case arises from the use of Stir- 
ling's approximation in an analogous way. (A more careful estimate is done in the proof of 
Lemma 3.11 in the appendix.) | 



48 



Random Worlds and Maximum Entropy 



Because of the exponential dependence on N times the entropy, the number of worlds 
associated with points of high entropy swamp all other worlds as N grows large. This 
concentration phenomenon, well-known in the field of statistical physics, forms the basis 
for our main result in this section. It asserts that it is possible to compute degrees of 
belief according to random worlds while ignoring all but those worlds whose entropy is near 
maximum. The next theorem essentially formalizes this phenomenon. 

Theorem 3.13: For all sufficiently small f, the following is true. Let Q be the points with 
greatest entropy in S T [KB] and let O C M K be any open set containing Q. Then for all 
6 £ £~ and for lim* £ {limsup,liminf} we have 

lim'PrJK.|JCB)= lim'*"" ™» A 

wv N—too #worlds T N [0](KB) 

We remark that this is quite a difficult theorem. We have discussed why Lemma 3.11 lets 
us look at models of KB whose entropy is (near) maximum. But the theorem tells us to look 
at the maximum-entropy points of S T [KB], which we defined using a (so far unmotivated) 
syntactic procedure applied to KB. It seems reasonable to expect that S T [KB] should tell 
us something about models of KB. But making this connection precise, and in particular 
showing how the maximum-entropy points of S T [KB] relate to models of KB with near- 
maximum entropy, is difficult. However, we defer all details of the proof of that result to 
the appendix. 

In general, Theorem 3.13 may seem to be of limited usefulness: knowing that we only 
have to look at worlds near the maximum-entropy point does not substantially reduce 
the number of worlds we need to consider. (Indeed, the whole point of the concentration 
phenomenon is that almost all worlds have high entropy.) Nevertheless, as the rest of this 
paper shows, this result can be quite useful when combined with the following two results. 
The first of these says that if all the worlds near the maximum-entropy points have a certain 
property, then we should have degree of belief 1 that this property is true. 

Corollary 3.14: For all sufficiently small f, the following is true. Let Q be the points with 
greatest entropy in S T [KB], let O C M K be an open set containing Q, and let 0[O] £ C = be 
an assertion that holds for every world W such that ir(W) £ O. Then 

V/ 00 {6[0]\KB) = l. 

Example 3.15: For the knowledge base true in Example 3.12, it is easy to see that the 
maximum-entropy point is (0.5,0.5). Fix some arbitrary e > 0. Clearly, there is some open 
set O around this point such that the assertion 6 = \\P{x)\\ x £ [0.5 — e, 0.5 + e] holds for 
every world in O. Therefore, we can conclude that 

Pr* (||P(z)|U G [0.5- e,0.5 + e] \true) = 1. I 

As we show in (Bacchus et al., 1994), formulas 6 with degree of belief 1 can essentially 
be treated just like other knowledge in KB. That is, the degrees of belief relative to KB 
and KB A 6 will be identical (even if KB and KB A 6 are not logically equivalent). More 
formally: 
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Theorem 3.16: (Bacchus et al., 1994) If Fx^lKB) = 1 and lim* £ {lim sup, lim inf}, 
then for any formula (p: 

lim* V/ N (ip\KB) = lim* V/ N (ip\KB A 0). 

N— »oo N— »oo 

Proof: For completeness, we repeat the proof from (Bacchus et al., 1994) here. Basic 
probabilistic reasoning shows that, for any N and f: 

F/ N (v\KB) = Pi ¥ N (<p\KB A 0) F/ N (0\KB) + Pi ¥ N (<p\KB A ->0) P4(^|iT5). 

By assumption, Pr]^-(#| iTB) tends to 1 when we take limits, so the first term tends to 
PT%((p\KB A 0). On the other hand, Pt%(^0\KB) has limit 0. Because Fi ¥ N ((p\K B A -.0) is 
bounded, we conclude that the second product also tends to 0. The result follows. | 

As we shall see in the next section, the combination of Corollary 3.14 and Theorem 3.16 
is quite powerful. 

4. Computing degrees of belief 

Although the concentration phenomenon is interesting, its application to actually computing 
degrees of belief may not be obvious. Since we know that almost all worlds will have high 
entropy, a direct application of Theorem 3.13 does not substantially reduce the number of 
worlds we must consider. Yet, as we show in this section, the concentration theorem can 
form the basis of a practical technique for computing degrees of belief in many cases. We 
begin in Section 4.1 by presenting the intuitions underlying this technique. In Section 4.2 
we build on these intuitions by presenting results for a restricted class of formulas: those 
queries which are quantifier-free formulas over a unary language with a single constant 
symbol. In spite of this restriction, many of the issues arising in the general case can be 
seen here. Moreover, as we show in Section 4.3, this restricted sublanguage is rich enough 
to allow us to embed two well-known propositional approaches that make use of maximum 
entropy: Nilsson's probabilistic logic (Nilsson, 1986) and the maximum-entropy extension 
of e-semantics (Geffner & Pearl, 1990) due to Goldszmidt, Morris, Pearl (1990) (see also 
(Goldszmidt, Morris, & Pearl, 1993)). In Section 4.4, we consider whether the results for 
the restricted language can be extended. We show that they can, but several difficult and 
subtle issues arise. 

4.1 The general strategy 

Although the random- worlds method is defined by counting worlds, we can sometimes find 
more direct ways to calculate the degrees of belief it yields. In (Bacchus et al., 1994) we 
present a number of such techniques, most of which apply only in very special cases. One 
of the simplest and most intuitive is the following version of what philosophers have termed 
direct inference (Reichenbach, 1949). Suppose that all we know about an individual c is 
some assertion ip(c); in other words, KB has the form ip(c) A KB 1 , and the constant c does 
not appear in KB 1 . Also suppose that KB, together with a particular tolerance f, implies 
that ||<p(a;)|'^(a;)|| a ; is in some interval [a,/?]. It seems reasonable to argue that c is should 
be treated as a "typical" element satisfying ip(x), because by assumption KB contains no 
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information suggesting otherwise. Therefore, we might hope to use the statistics directly, 
and conclude that Pr^ (</3(c)|iT5) £ [a,/?]. This is indeed the the following theorem 

shows. 

Theorem 4.1: (Bacchus et al., 1994) Let KB be a knowledge base of the form ip(c) A KB' , 
and assume that for all sufficiently small tolerance vectors f, 

KB[t] |= H^M^H* G [a, /3]. 

If no constant in c appears in KB 1 , in (f(x), or in ip(x), then Pr 00 (</3(c )\KB) £ [a,/3] (if 
the degree of belief exists at all). 

This result, in combination with the results of the previous section, provides us with a 
very powerful tool. Roughly speaking, we propose to use the following strategy: The basic 
concentration phenomenon says that most worlds are very similar in a certain sense. As 
shown in Corollary 3.14, we can use this to find some assertions that are "almost certainly" 
true (i.e., with degree of belief 1) even if they are not logically implied by KB. Theorem 3.16 
then tells us that we can treat these new assertions as if they are in fact known with 
certainty. When these new assertions state statistical "knowledge", they can vastly increase 
our opportunities to apply direct inference. The following example illustrates this idea. 

Example 4.2: Consider a very simple knowledge base over a vocabulary containing the 
single unary predicate {P}. 

KB = (\\P(x)\\ x ^ 0.3). 

There are two atoms A\ and A 2 over V, with A\ = P and A2 = ->P. The solution space of 
this KB given f is clearly 

S*[KB] = {(ui,u 2 ) £ A 2 : u x < 0.3 + n}. 

A straightforward computation shows that, for T\ < 0.2, this has a unique maximum-entropy 
point v = (0.3 + 7~i, 0.7 — T\). 

Now, consider the query P(c). For all e > 0, let 9[e] be the formula ^(a;)^ £ [(0.3 + 
7~i) — e, (0.3 + 7~i) + e]. This satisfies the condition of Corollary 3.14, so it follows that 
Pr^ (^[e]|iT5) = 1. Using Theorem 3.16, we know that for lim* £ {liminf,limsup}, 

Urn* P/ N (P(c)\KB) = lim* V/ N {P{c)\KB A 0[e]). 

rJ — »oo A/— »oo 

But now we can use direct inference. (Note that here, our "knowledge" about c is vacuous, 
i.e., u true(c)" .) We conclude that, if there is any limit at all, then necessarily 

Pil(P(c)\KB A 9[e]) £ [(0.3 + n) - e, (0.3 + n) + e]. 

So, for all e > 0, 

Fri(P(c)\KB) £ [(0.3 + n) - e, (0.3 + n) + e]. 

Since this is true for all e, the only possible value for Pr^ Q (P(c)|iT5) is 0.3 + ri, which is the 
value of u\ (i.e., ^(a;)^) at the maximum-entropy point. Note that it is also clear what 
happens as f tends to 0: Pr 00 (P(c)|iT5) is 0.3. I 
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This example demonstrates the main steps of one possible strategy for computing degrees 
of belief. First the maximum-entropy points of the space S T [KB] are computed as a function 
of f. Then, these are used to compute Pr^ (</3|iT5), assuming the limit exists (if not, the 
limsup and liminf of Fx^i^KB) are computed instead). Finally, we compute the limit of 
this probability as f goes to zero. 

Unfortunately, this strategy has a serious potential problem. We clearly cannot compute 
Pr^ (</3|iT5) separately for each of the infinitely many tolerance vectors f and then take the 
limit as f goes to 0. We might hope to compute this probability as an explicit function of 
f, and then compute the limit. For instance, in Example 4.2 Pr^ Q (P(c)|iT5) was found to 
be 0.3 + 7~i, and so it is easy to see what happens as T\ — ► 0. But there is no reason to 
believe that Pr^ (</3|iT5) is, in general, an easily characterizable function off. If it is not, 
then computing the limit as f goes to can be difficult or impossible. We would like to 
find a way to avoid this explicit limiting process altogether. It turns out that this is indeed 
possible in some circumstances. The main requirement is that the maximum-entropy points 
of S T [KB] converge to the maximum-entropy points of S°[KB]. (For future reference, notice 
that S°[KB] is the closure of the solution space of the constraints obtained from KB by 
replacing all occurrences of ~i by = and all occurrences of <i by <.) In many such cases, 
we can compute Pr 00 (</3|iT5) directly in terms of the maximum-entropy points of S°[KB], 
without taking limits at all. 

As the following example shows, this type of continuity does not hold in general: the 
maximum-entropy points of S T [KB] do not necessarily converge to those of S°[KB]. 

Example 4.3: Consider the knowledge base 

KB = (\\P(x)\\ x «i 0.3 V ||P(z)|U ~2 0.4) A ||P(z)|U 5^3 0.4 . 

It is easy to see that S^KB} is just {(0.3,0.7)}: The point (0.4,0.6) is disallowed by the 
second conjunct. Now, consider S T [KB] for f > 0. If T2 < t$, then S T [KB] indeed does 
not contain points where u\ is near 0.4; the maximum-entropy point of this space is easily 
seen to be 0.3 + T\. However, if T2 > T3 then there will be points in S T [KB] where u\ is 
around 0.4; for instance, those where 0.4 + T3 < u\ < 0.4 + r^. Since these points have 
a higher entropy than the points in the vicinity of 0.3, the former will dominate. Thus, 
the set of maximum-entropy points of S T [KB] does not converge to a single well-defined 
set. What it converges to (if anything) depends on how f goes to 0. This nonconvergence 
has consequences for degrees of belief. It is not hard to show Pr^ (P(c)|iTS) can be either 
0.3 + 7~i or 0.4 + T2, depending on the precise relationship between 7~i, T2, and T3. It follows 
that Pr 00 (P(c)|iTP) does not exist. | 

We say that a degree of belief Pr 00 (</3|iTP) is not robust if the behavior of Vi T 00 ( y ip\ KB) (or of 
liminf Fx^((p\KB) and lim sup Fx^((p\KB)) as f goes to depends on how f goes to 0. In 
other worlds, nonrobustness describes situations when Pr 00 (</3|iTP) does not exist because 
of sensitivity to the exact choice of tolerances. We shall see a number of other examples of 
nonrobustness in later sections. 

It might seem that the notion of robustness is an artifact of our approach. In particular, 
it seems to depend on the fact that our language has the expressive power to say that the two 
tolerances represent a different degree of approximation, simply by using different subscripts 
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(~2 vs. ~3 in the example). In an approach to representing approximate equality that does 
not make these distinctions, we are bound to get the answer 0.3 in the example above, since 
then IIP^)!!^ 5^3 0.4 really would be the negation of ^(a;)^ ~2 0.4. We would argue 
that the answer 0.3 is not as reasonable as it might at first seem. Suppose one of the two 
different instances of 0.4 in the previous example had been slightly different; for example, 
suppose we had used 0.399 rather than 0.4 in the first of them. In this case, the second 
conjunct is essentially vacuous, and can be ignored. The maximum-entropy point in S°[KB] 
is now 0.399, and we indeed derive a degree of belief of 0.399 in -P(c). Thus, arbitrarily small 
changes to the numbers in the original knowledge base can cause large changes in our degrees 
of belief. But these numbers are almost always the result of approximate observations; this 
is reflected by our decision to use approximate equality rather than equality when referring 
to them. It does not seem reasonable to base actions on a degree of belief that can change 
so drastically in the face of small changes in the measurement of data. Note that, if we 
know that the two instances of 0.4 do, in fact, denote exactly the same number, we can 
represent this by using the same approximate equality connective in both disjuncts. In this 
case, it is easy to see that we do get the answer 0.3. 

A close look at the example shows that the nonrobustness arises because of the negated 
proportion expression ^(a;)^ 5^3 0.4. Indeed, we can show that if we start with a KB 
in canonical form that does not contain negated proportion expressions then, in a precise 
sense, the set of maximum-entropy points of S T [KB] necessarily converges to the set of 
maximum-entropy points of S°[KB]. An argument can be made that we should eliminate 
negated proportion expressions from the language altogether. It is one thing to argue 
that sometimes we have statistical values whose accuracy we are unsure about, so that we 
want to make logical assertions less stringent than exact numerical equality. It is harder 
to think of cases in which the opposite is true, and all we know is that some statistic is 
"not even approximately" equal to some value. However, we do not eliminate negated 
proportion expressions from the language, since without them we would not be able to 
prove an analogue to Theorem 3.5. (They arise when we try to flatten nested proportion 
expressions, for example.) Instead, we have identified a weaker condition that is sufficient 
to prevent problems such as that seen in Example 4.3. Essential positivity simply tests that 
negations are not interacting with the maximum-entropy computation in a harmful way. 

Definition 4.4: Let T^(KB[0]) be the result of repl acing each strict inequality in r(iT5[0]) 
with its weakened version. More formally, we replace each subformula of the form t < 
with t < 0, and each subformula of the form t > with t > 0. (Recall that these are the 
only constraints possible in T(KB[0]), since all tolerance variables Si are assigned 0.) Let 
S^ d [KB] be Sol[T<(KB[0])], where we use X to denote the closure of X. We say that KB 
is essentially positive if the sets S-°[KB] and S°[KB] have the same maximum-entropy 
points. I 

Example 4.5: Consider again the knowledge base KB from Example 4.3. The constraint 
formula r(iT5[0]) is (after simplification): 

(ui = 0.3 V«i = 0.4) A (ui < 0.4 Viii> 0.4). 

Its "weakened" version is T^(KB[6]): 

(ui = 0.3 V«i = 0.4) A (ui < 0.4 Viii> 0.4), 
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which is clearly equivalent to u\ = 0.3 V u\ = 0.4. Thus, S°[KB] = {(u\,U2) £ A 2 : 
ui < 0.3} whereas S^ S [KB] = S S [KB] U {(0.4,0.6)}. Since the two spaces have different 
maximum-entropy points, the knowledge base KB is not essentially positive. | 

As the following result shows, essential positivity suffices to guarantee that the maximum- 
entropy points of S T [KB] converge to those of S°[KB]. 

Proposition 4.6: Assume that KB is essentially positive and let Q be the set of maximum- 
entropy points of S°[KB] (and thus also of S-°[KB]). Then for all e > and all sufficiently 
small tolerance vectors f (where "sufficiently small" may depend on e), every maximum- 
entropy point of S T [KB] is within e of some maximum- entropy point in Q. 

4.2 Queries for a single individual 

We now show how to compute Pr 00 (</3|iT5) for a certain restricted class of first-order for- 
mulas (p and knowledge bases KB. The most significantly restriction is that the query (p 
should be a quantifier-free (first-order) sentence over the vocabulary V U {c}; thus, it is a 
query about a single individual, c. While this class is rather restrictive, it suffices to express 
many real-life examples. Moreover, it is significantly richer than the language considered 
by Paris and Vencovska (1989). 

The following definition helps define the class of interest. 

Definition 4.7: A formula is essentially propositional if it is a quantifier-free and proportion- 
free formula in the language £~({Pi, . . . , Pk}) (so that, in particular, it has no constant 
symbols) and has only one free variable x. | 

We say that (f(c) is a simple query for KB if: 

• (f(x) is essentially propositional, 

• KB is of the form ip(c) A KB 1 , where ip(x) is essentially propositional and KB 1 does 
not mention c. 

Thus, just as in Theorem 4.1, we suppose that ip(c) summarizes all that is known about 
c. In this section, we focus on computing the degree of belief Pr 00 (</3(c)|iT5) for a formula 
(f(c) which is a simple query for KB. 

Note that an essentially propositional formula £(x) is equivalent to a disjunction of 
atoms. For example, over the vocabulary {Pi,P2}, the formula Pi(x) V P2(x) is equivalent 
to Ai(x)\J A2(x)\J As(x) (where the atoms are ordered as in Example 3.2). For an essentially 
propositional formula £, we take *4(£) be the (unique) set of atoms such that £ is equivalent 
to V^e^) Aj(x). 

If we view a tuple u £ A K as a probability assignment to the atoms, we can extend u to 
a probability assignment on all essentially propositional formulas using this identification 
of an essentially propositional formula with a set of atoms: 

Definition 4.8: Let £ be an essentially propositional formula. We define a function : 
A K M as follows: 

A 3 eA{t) 
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For essentially propositional formulas (f(x) and ip(x) we define the (partial) function -P^,/,] : 
A K -> M to be: 

F ^ U) ~ -linear 

Note that this function is undefined when F^{u) = 0. I 

As the following result shows, if (p is a simple query for KB (of the form tI>(c)AKB'), then 
all that matters in computing Pr 00 (</3|iT5) is Fy^{u) for tuples u of maximum entropy. 
Thus, in a sense, we are only using KB 1 to determine the space over which we maximize 
entropy. Having defined this space, we can focus on ip and (p in determining the degree of 
belief. 

Theorem 4.9: Suppose y(c) is a simple query for KB . For all f sufficiently small, if Q 
is the set of maximum- entropy points in S T [KB] and F^(v) > for all v £ Q, then for 
lim* £ {limsup,liminf} we have 



Km*PT T N (<p(c)\KB)e 

iv — »oo 



inf F [vW (v),sup F [vW (v) 



The following is an immediate but important corollary of this theorem. It asserts that, if 
the space S T [KB] has a unique maximum-entropy point, then its value uniquely determines 
the probability Pr^(vj(c)|iT5). 

Corollary 4.10: Suppose y(c) is a simple query for KB. For all f sufficiently small, if v 
is the unique maximum- entropy point in S T [KB] and F^(v) > 0, then 

F/ M c)\KB) = F [vW (v). 

We are interested in Pr 00 (</3(c)|iT5), which means that we are interested in the limit of 
Pr^ (</3(c)|iT5) as t — > 0. Suppose KB is essentially positive. Then, by the results of the 
previous section and the continuity of F^^, it is enough to look directly at the maximum- 
entropy points of S°[KB]. More formally, by combining Theorem 4.9 with Proposition 4.6, 
we can show: 

Theorem 4.11: Suppose y(c) is a simple query for KB. If the space S°[KB] has a unique 
maximum- entropy point v, KB is essentially positive, and F^(v) > 0, then 

V^^KB) = F [m (v). 

We believe that this theorem will turn out to cover a lot of cases that occur in practice. 
As our examples and the discussion in the next section show, we often do get simple queries 
and knowledge bases that are essentially positive. Concerning the assumption of a unique 
maximum-entropy point, note that the entropy function is convex and so this assumption 
is automatically satisfied if S°[KB] is a convex space. Recall that a space S is convex if 
for all u,u' £ S, and all a £ [0, 1], it is also the case that au + (1 — a)u £ S. The space 
S°[KB] is surely convex if it is defined using a conjunction of linear constraints. While it 
is clearly possible to create knowledge bases where S°[KB] has multiple maximum-entropy 
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points (for example, using disjunctions), we expect that such knowledge bases arise rarely in 
practical applications. Perhaps the most restrictive assumption made by this theorem is the 
seemingly innocuous requirement that F^(v) > 0. This assumption is obviously necessary 
for the theorem to hold; without it, the function is simply not defined. Unfortunately, 

we show in Section 4.4 that this requirement is, in fact, a severe one; in particular, it prevents 
the theorem from being applied to most examples derived from default reasoning, using our 
statistical interpretation of defaults (Bacchus et al., 1994). 

We close this subsection with an example of the theorem in action. 

Example 4.12: Let the language consist of V = {Hepatitis, Jaundice, BlueEyed} and the 
constant Eric. There are eight atoms in this language. We use Ap^p^p^ to denote the atom 
P{(x) AP 2 (x) AP 3 (x), where P[ is either H (denoting Hepatitis) or H (denoting -^Hepatitis), 
P 2 is J or J (for Jaundice and ^Jaundice, respectively), and P 3 is B or B (for BlueEyed 
and ^BlueEyed, respectively). 

Consider the knowledge base KB hep- 

\/x (Hepatitis(x) =^ Jaundice(x)) A 
\\Hepatitis(x)\ Jaundice(x)\\ x ~i 0.8 A 
\\BlueEyed(x)\\ x k- 2 0.25 A 
Jaundice ( Eric ) . 

If we order the atoms as A H jb,A H jbi A hJb' A hJb' A h jb' A h jb ' A hj B' A hjb > then 
it is not hard to show that T(KBhep) is: 

u 3 =0 A 

u 4 =0 A 

(u 1 + u 2 ) < (0.8 + £i)(ui + u 2 + u 5 + u 6 ) A 

(ui + u 2 ) > (0.8 - £i)(ui + u 2 + u 5 + u 6 ) A 

Oi + u 3 + u 5 + u 7 ) < (0.25 + £ 2 ) A 

(ui + u 3 + u 5 + u 7 ) > (0.25- £ 2 ) A 

(ui + u 2 +u 5 + u 6 ) > 0. 

To find the space S°[KBkep] we simply set S\ = £2 = 0. Then it is quite straightforward to 
find the maximum-entropy point in this space, which, taking 7 = 2 1-6 , is: 

( \ ( 1 3 n n 1 3 7 3 7 ^ 

{v 1 ,v 2 ,v 3 ,v 4 ,v 5 ,v 6 ,v 7 ,v 8 ) = — — , — — ,0,0,—— — 7,77^- — vTTF^ — v 77F~i — T ■ 

V5 + 7 5 + 7 4(5 + 7) 4(5 + 7) 4(5 + 7) 4(5 + 7)/ 

Using v, we can compute various asymptotic probabilities very easily. For example, 

Fl 00 (^Hepatitis(^Eric)\KB hep) = P[Hepatitis\Jaundice] 

(v) 

V\ + v 2 



V 1 + V 2 + V 5 + V6 

5+7 ^ 5+7 

1,3, 1 



0.8, 



5+7 T 5+7 T 4(5+7)' 4(5+7) 



as expected. Similarly, we can show that Vi^BlueEyed^Eric^KBhep) = 0.25 and that 
Fioo^BlueEyed^Eric) A Hepatitis[Eric)\KB hep) = 0.2. Note that the first two answers also 
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follow from the direct inference principle (Theorem 4.1), which happens to be applicable 
in this case. The third answer shows that BlueEyed and Hepatitis are being treated as 
independent. It is a special case of a more general independence phenomenon that applies 
to random worlds; see (Bacchus et al., 1994, Theorem 5.27). I 

4.3 Probabilistic propositional logic 

In this section we consider two variants of probabilistic propositional logic. As the following 
discussion shows, both can easily be captured by our framework. The embedding we discuss 
uses simple queries throughout, allowing us to appeal to the results of the previous section. 

Nilsson (1986) considered the problem of what could be inferred about the proba- 
bility of certain propositions given some constraints. For example, we might know that 
Fx(fly\bird) > 0.7 and that Fx(yellow) < 0.2, and be interested in Fx(fly\bird A yellow). 
Roughly speaking, Nilsson suggests computing this by considering all probability distri- 
butions consistent with the constraints, and then computing the range of values given to 
Fx(fly\bird A yellow) by these distributions. Formally, suppose our language consists of k 
primitive proposition, p\,...,pk- Consider the set of K = 2 k truth assignments these 
propositions. We give semantics to probabilistic statements over this language in terms of 
a probability distribution /j, over the set (see (Fagin, Halpern, & Megiddo, 1990) for de- 
tails). Since each truth assignment lo £ determines the truth value of every propositional 
formula /3, we can determine the probability of every such formula: 

Pr„(/3) = M")- 

w\=(3 

Clearly, we can determine whether a probability distribution /j, satisfies a set A of proba- 
bilistic constraints. The standard notion of probabilistic propositional inference would say 
that A |= Pr(/3) £ [Ai, A2] if Pr /ti (/3) is within the range [Ai, A2] for every distribution /j, that 
satisfies the constraints in A. 

Unfortunately, while this is a very natural definition, the constraints that one can derive 
from it are typically quite weak. For that reason, Nilsson suggested strengthening this no- 
tion of inference by applying the principle of maximum entropy: rather than considering all 
distributions /j, satisfying A, we consider only the distribution(s) fi* that have the greatest 
entropy among those satisfying the constraints. As we now show, one implication of our 
results is that the random- worlds method provides a principled motivation for this introduc- 
tion of maximum entropy to probabilistic propositional reasoning. In fact, the connection 
between probabilistic propositional reasoning and random worlds should now be fairly clear: 

• The primitive propositions pi, . . . ,pk correspond to the unary predicates Pi, . . . , P&. 

• A propositional formula f3 over pi, . . . ,pk corresponds uniquely to an essentially propo- 
sitional formula £@ as follows: we replace each occurrence of the propositional symbol 
p l with P l {x). 

• The set A of probabilistic constraints corresponds to a knowledge base KB'[A] — a 
constant-free knowledge base containing only proportion expressions. The correspon- 
dence is as follows: 
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— A probability expression of the form Pr(/3) appearing in A is replaced by the 
proportion expression | |^(a;)| \ x . Similarly, a conditional probability expression 
Pr(/3|/3') is replaced by | l^sC^) l^'C^) I U- 

— Each comparison connective = is replaced by ~i for some i, and each < with <i. 
(The particular choices for the approximate equality connectives do not matter 
in this context.) 

The other elements that can appear in a proportion formula (such as rational num- 
bers and arithmetical connectives) remain unchanged. For example, the formula 
Fx(fly\bird) > 0.7 would correspond to the proportion formula \\Fly(x)\Bird(x)\\ x ^ 
0.7. 

• There is a one-to-one correspondence between truth assignments and atoms: the truth 
assignment oj corresponds to the atom A = P{A . . .AP^ where P[ is P{ if co(pi) = true 
and -i Pi otherwise. Let loi, . . . ,lok be the truth assignments corresponding to the 
atoms Ai, . . ., Ak, respectively. 

• There is a one-to-one correspondence between probability distributions over the set 
of truth assignments and points in A K . For each point u £ A K , let ^ denote the 
corresponding probability distribution over 0, where = U{. 

Remark 4.13: Clearly, ojj |= f3 iff Aj £ A(£p). Therefore, for all u, we have 

= P W/?)- ■ 

The following result demonstrates the tight connection between probabilistic preposi- 
tional reasoning using maximum entropy and random worlds. 

Theorem 4.14: Let A be a conjunction of constraints of the form Pr(/3|/3') = A or 
Pr(/3 £ [Ai,A2]. There is a unique probability distribution fi* of maximum entropy 
satisfying A. Moreover, for all f3 and f3' , i/Pr /ti *(/3 / ) > 0, then 

Proo(&(c)|&'(c) A KB'[A]) = P V W)- 

Theorem 4.14 is an easy corollary of Theorem 4.11. To check that the preconditions 
of the latter theorem apply, note that the constraints in A are linear, and so the space 
5 ,0 [iT5'[A]] has a unique maximum-entropy point v. In fact, it is easy to show that [i$ is 
the (unique) maximum-entropy probability distribution over satisfying the constraints 
A. In addition, because there are no negated proportion expressions in A, the formula 
KB = £pi(c) A KB'[A] is certainly essentially positive. 

Most applications of probabilistic propositional reasoning consider simple constraints 
of the form used in the theorem, and so such applications can be viewed as very special 
cases of the random- words approach. In fact, this theorem is essentially a very old one. 
The connection between counting "worlds" and the entropy maximum in a space defined 
as a conjunction of linear constraints is very well-known. It has been extensively studied 
in the field of thermodynamics, starting with the 19th century work of Maxwell and Gibbs. 
Recently, this type of reasoning has been applied to problems in an AI context by Paris and 
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Vencovska (1989) and Shastri (1989). The work of Paris and Vencovska is particularly rele- 
vant because they also realize the necessity of adopting a formal notion of "approximation" , 
although the precise details of their approach differ from ours. 

To the best of our knowledge, most of the work on probabilistic propositional reason- 
ing and all formal presentations of the entropy /worlds connection (in particular, those of 
(Paris & Vencovska, 1989; Shastri, 1989)) have limited themselves to conjunctions of lin- 
ear constraints. Our more general language gives us a great deal of additional expressive 
power. For example, it is quite reasonable to want the ability to express that properties 
are (approximately) statistically independent. For example, we may wish to assert that 
Bird(x) and Yellow(x) are independent properties by saying \\Bird(x) A Yellow(x)\\ x ~ 
| \Bird(x)\ \ x ■ || Yellow(x)\\ x . Clearly, such constraints are not linear. Nevertheless, our The- 
orem 4.11 covers such cases and much more. 

A version of probabilistic propositional reasoning has also been used to provide proba- 
bilistic semantics for default reasoning (Pearl, 1989). Here also, the connection to random 
worlds is of interest. In particular, it follows from Corollary 4.10 that the recent work of 
Goldszmidt, Morris, and Pearl (1990) can be embedded in the random-worlds framework. 
In the rest of this subsection, we explain their approach and the embedding. 

Consider a language consisting of propositional formulas over the propositional variables 
pi, . . - ,Pk, and default rules of the form B — ► C (read "_B's are typically C"s"), where B 
and C are propositional formulas. A distribution /j, is said to e-satisfy a default rule B — ► C 
if n(C\B) > 1 — e. In addition to default rules, the framework also permits the use of 
material implication in a rule, as in B =>■ C. A distribution fx is said to satisfy such a rule 
if n(C\B) = 1. A parameterized probability distribution (PPD) is a collection {/x e } e> o of 
probability distributions over 0, parameterized by e. A PPD {/x e } e> o e-satisfies a set 1Z of 
rules if for every e, fi e e-satisfies every default rule r £ 1Z and satisfies every non-default 
rule r £ 1Z. A set 1Z of default rules e-entails B — ► C if for every PPD that e-satisfies 1Z, 
lim e ^o fi e (C\B) = 1. 

As shown in (Geffner & Pearl, 1990), e-entailment possesses a number of reasonable 
properties typically associated with default reasoning, including a preference for more spe- 
cific information. However, there are a number of desirable properties that it does not have. 
Among other things, irrelevant information is not ignored. (See (Bacchus et al., 1994) for 
an extensive discussion of this issue.) 

To obtain additional desirable properties, e-semantics is extended in (Goldszmidt et al., 
1990) by an application of the principle of maximum entropy. Instead of considering all 
possible PPD's, as above, we consider only the PPD |/^*-^| ^ such that, for each e, 

fJ-*H has the maximum entropy among distributions that e-satisfy all the rules in 1Z. (See 
(Goldszmidt et al., 1990) for precise definitions and technical details.) Note that, since the 
constraints used to define fJ-*n are all linear, there is indeed a unique such point of maximum 
entropy. A rule B — ► C is an ME-plausible consequence of 1Z if lim e _»o / u *7j(C|-^) = 1- 
The notion of ME-plausible consequence is analyzed in detail in (Goldszmidt et al., 1990), 
where it is shown to inherit all the nice properties of e-entailment (such as the preference 
for more specific information), while successfully ignoring irrelevant information. Equally 
importantly, algorithms are provided for computing the ME-plausible consequences of a set 
of rules in certain cases. 
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Our maximum-entropy results can be used to show that the approach of (Goldszmidt 
et al., 1990) can be embedded in our framework in a straightforward manner. We simply 
translate a default rule r of the form B — ► C into a first-order default rule 



as in our earlier translation of Nilsson's approach. Note that the formulas that arise under 
this translation all use the same approximate equality connective ~i. The reason is that 
the approach of (Goldszmidt et al., 1990) uses the same e for all default rules. We can 
similarly translate a (non-default) rule r of the form B =>■ C into a first-order constraint 
using universal quantification: 



Under this translation, we can prove the following theorem. 

Theorem 4.15: Let c be a constant symbol. Using the translation described above, for a 
set 1Z of defeasible rules, B — ► C is an ME-plausible consequence of 1Z iff 



In particular, this theorem implies that all the computational techniques and results 
described in (Goldszmidt et al., 1990) carry over to this special case of the random-worlds 
method. It also shows that random- world provides a principled justification for the approach 
(Goldszmidt et al., 1990) present (one which is quite different from the justification given 
in (Goldszmidt et al., 1990) itself). 

4.4 Beyond simple queries 

In Section 4.2 we restricted attention to simple queries. Our main result, Theorem 4.11, 
needed other assumptions as well: essential positivity, the existence of a unique maximum- 
entropy point v, and the requirement that F^(v) > 0. We believe that this theorem is useful 
in spite of its limitations, as demonstrated by the discussion in Section 4.3. Nevertheless, 
this result allows us to take advantage of only a small fragment of our rich language. Can 
we find a more general theorem? After all, the basic concentration result (Theorem 3.13) 
holds with essentially no restrictions. In this section we show that it is indeed possible to 
extend Theorem 4.11 significantly. However, there are serious limitations and subtleties. 
We illustrate these problems by means of examples, and then state an extended result. 

Our attempt to address these problems (so far as is possible) leads to a rather com- 
plicated final result. In fact, the problems we discuss are as interesting and important as 
the theorem we actually give: they help us understand more of the limits of maximum 
entropy. Of course, every issue we discuss in this subsection is relatively minor compared 
to maximum entropy's main (apparent) restriction, which concerns the use of non-unary 
predicates. For the reader who is less concerned about the other, lesser, issues we remark 
that it is possible to skip directly to Section 5. 

We first consider the restrictions we placed on the KB, and show the difficulties that 
arise if we drop them. We start with the restriction to a single maximum-entropy point. As 



Or =def ||£C(Z)IUZ)II 



1 1, 



Or =def Vz (£b(x) =>• £c(x)). 
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the concentration theorem (Theorem 3.13) shows, the entropy of almost every world is near 
maximum. But it does not follow that all the maximum-entropy points are surrounded by 
similar numbers of worlds. Thus, in the presence of more than one maximum-entropy point, 
we face the problem of finding the relative importance, or weighting, of each maximum- 
entropy point. As the following example illustrates, this weighting is often sensitive to the 
tolerance values. For this reason, non-unique entropy maxima often lead to nonrobustness. 

Example 4.16: Suppose $ = {P, c}, and consider the knowledge base 

KB = (\\P(x)\\ x ^ 0.3) V (\\P(x)\\ x y 2 0.7). 

Assume we want to compute Pr 00 (P(c)|iTS). In this case, S T [KB] is 

{(u 1 ,u 2 )eA 2 : ui < 0.3 + ri or ui > 0.7-r 2 }, 

and S 6 [KB] is 

{(^1,^2) G A 2 : ui < 0.3 or u x > 0.7}. 

Note that S°[KB] has two maximum-entropy points: (0.3,0.7) and (0.7,0.3). 

Now consider the maximum-entropy points of S T [KB] for f > 0. It is not hard to show 
that if 7~i > t 2 , then this space has a unique maximum-entropy point, (0.3 + Ti,0.7 — T\). 
In this case, Pr^ (P(c)|iTS) = 0.3 + T\. On the other hand, if T\ < t 2 , then the unique 
maximum-entropy point of this space is (0.7 + t 2 , 0.3 — t 2 ), in which case Pr^ Q (P(c)|iTP) = 
0.7 + t 2 . If 7~i = t 2 , then the space S T [KB] has two maximum-entropy points, and by 
symmetry we obtain that Pr^ Q (P(c)|iTP) = 0.5. So, by appropriately choosing a sequence 
of tolerance vectors converging to 0, we can make the asymptotic value of this fraction 
either 0.3, 0.5, or 0.7. Thus Pr 00 (P(c)|iTP) does not exist. 

It is not disjunctions per se that cause the problem here: if we consider instead the 
database KB 1 = (\\P(x)\\ x ^1 0.3) V (HP^)^ y 2 0.6), then there is no difficulty. There is 
a unique maximum-entropy point of S°[KB'] — (0.6,0.4) — and the asymptotic probability 
Pr 00 (P(c)|iTP / ) = 0.6, as we would want. 7 | 

In light of this example (and many similar ones we can construct), we continue to assume 
that there is a single maximum-entropy point. As we argued earlier, we expect this to be 
true in typical practical applications, so the restriction does not seem very serious. 

We now turn our attention to the requirement that F^(v) > 0. As we have already 
observed, this seems to be an obvious restriction to make, considering that the function 
F[(p\ip](v) is not defined otherwise. However, this difficulty is actually a manifestation of a 
much deeper problem. As the following example shows, any approach that just uses the 
maximum-entropy point of S°[KB] will necessarily fail in some cases where F[^(v) = 0. 

Example 4.17: Consider the knowledge base 

KB = (\\Penguin(x)\\ x &i 0) A (\\Fly(x)\Penguin(x)\\ x & 2 0) A Penguin(Tweety). 

7. We remark that it is also possible to construct examples of multiple maximum-entropy points by using 
quadratic constraints rather than disjunction. 
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Suppose we want to compute Fx 00 (Fly(Tweety)\Penguin(Tweety)). We can easily conclude 
from Theorem 4.1 that this degree of belief is 0, as we would expect. However, we cannot 
reach this conclusion using Theorem 4.11 or anything like it. For consider the maximum- 
entropy point of S°[KB]. The coordinates v\, corresponding to Fly A Penguin, and v 2 , 
corresponding to -i Fly A Penguin, are both 0. Hence, F^p engu ^(v) = 0, so that Theorem 4.11 
does not apply. 

But, as we said, the problem is more fundamental. The information we need (that the 
proportion of flying penguins is zero) is simply not present if all we know is the maximum- 
entropy point v. We can obtain the same space S°[KB] (and thus the same maximum- 
entropy point) from quite different knowledge bases. In particular, consider KB' which 
simply asserts that (\\Penguin(x)\\ x ~i 0) A Penguin(Tweety). This new knowledge base 
tells us nothing whatsoever about the fraction of flying penguins, and in fact it is easy to 
show that Fx 00 (Fly(Tweety)\KB') = 0.5. But of course it is impossible to distinguish this 
case from the previous one just by looking at v. It follows that no result in the spirit of 
Theorem 4.11 (which just uses the value of v) can be comprehensive. | 

The example shows that the philosophy behind Theorem 4.11 cannot be extended very 
far, if at all: it is inevitable that there will be problems when F^(v) = 0. But it is natural to 
ask whether there is a different approach altogether in which this restriction can be relaxed. 
That is, is it possible to construct a technique for computing degrees of belief in those cases 
where F^ = 0? As we mentioned in Section 4.1, we might hope to do this by computing 
Pr^ (</3|iT5) as a function of f and then taking the limit as f goes to 0. In general, this 
seems very hard. But, interestingly, the computational technique of (Goldszmidt et al., 
1990) does use this type of parametric analysis, demonstrating that things might not be 
so bad for various restricted cases. Another source of hope is to remember that maximum 
entropy is, for us, merely one tool for computing random- worlds degrees of belief. There 
may be other approaches that bypass entropy entirely. In particular, some of the theorems 
we give in (Bacchus et al., 1994) can be seen as doing this; these theorems will often apply 
even if F^ = 0. 

Another assumption made throughout Section 4.2 is that the knowledge base has a spe- 
cial form, namely ip(c) A KB', where ip is essentially propositional and KB' does not contain 
any occurrences of c. The more general theorem we state later relaxes this somewhat, as 
follows. 

Definition 4.18: A knowledge base KB is said to be separable with respect to query (p if 
it has the form ip A KB', where ip contains neither quantifiers nor proportions, and KB' 
contains none of the constant symbols appearing in cp or in if). 8 | 

It should be clear that if a query y(c) is simple for KB (as assumed in previous subsection), 
then the separability condition is satisfied. 

As the following example shows, if we do not assume separability, we can easily run into 
nonrobust behavior: 

Example 4.19: Consider the following knowledge base KB over the vocabulary $ = {P, c}: 
(\\P(x)\\ x ~i 0.3 A P(c)) V (\\P(x)\\ x v 2 0.3 A -ni>(c)). 

8. Clearly, since our approach is semantic, it also suffices if the knowledge base is equivalent to one of this 
form. 
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KB is not separable with respect to the query -P(c). The space S°[KB] consists of a 
unique point (0.3,0.7), which is also the maximum-entropy point. Both disjuncts of KB 
are consistent with the maximum-entropy point, so we might expect that the presence 
of the conjuncts P(c) and ->P(c) in the disjuncts would not affect the degree of belief. 
That is, if it were possible to ignore or discount the role of the tolerances, we would 
expect Pr 00 (P(c)|iTS) = 0.3. However, this is not the case. Consider the behavior of 
Pr^ (P(c)|iTS) for f > 0. If T\ > T2, then the maximum-entropy point of S T [KB] is 
(0.3 + Ti,0.7 — 7~i). Now, consider some e > sufficiently small so that T2 + e < T\. By 
Corollary 3.14, we deduce that Pr£,((| |P(a;)| \ x > 0.3 + r 2 ) | KB) = 1. Therefore, by The- 
orem 3.16, Pr^(P(c)|iTP) = Pr^(P(c) | KB A (\\P(x)\\ x > 0.3 + r 2 )) (assuming the limit 
exists). But since the newly added expression is inconsistent with the second disjunct, we 
obtain that Pr^(P(c)|iTP) = Pr^(P(c) | P(c) A (||P(a:)|U ~i 0.3)) = 1, and not 0.3. On 
the other hand, if T\ < T2, we get the symmetric behavior, where Pr^ Q (P(c)|iTP) = 0. Only 
if n = T2 do we get the expected value of 0.3 for Pr^ Q (P(c)|iTP). Clearly, by appropriately 
choosing a sequence of tolerance vectors converging to 0, we can make the asymptotic value 
of this fraction any of 0, 0.3, or 1, or not exist at all. Again, Pr 00 (P(c)|iTP) is not robust. | 

We now turn our attention to restrictions on the query. In Section 4.2, we restricted 
to queries of the form (f(c), where (f(x) is essentially propositional. Although we intend to 
ease this restriction, we do not intend to allow queries that involve statistical information. 
The following example illustrates the difficulties. 

Example 4.20: Consider the knowledge base KB = \\P(x)\\ x ~i 0.3 and the query (p = 
\\P(x)\\ x ~2 0.3. It is easy to see that the unique maximum-entropy point of S T [KB] is (0.3 + 
ri,0.7- ri). First suppose r 2 < T\. From Corollary 3.14, it follows that Pr^((| \P(x)\ \ x > 
0.3 + r 2 ) | KB) = 1. Therefore, by Theorem 3.16, Vi^^KB) = Vi^^KB A (ll-PjaOlU > 
0.3 + T2)) (assuming the limit exists). The latter expression is clearly 0. On the other hand, 
if 7~i < T2, then iTP[f] |= (f[f], so that Pr^ (</3|iTP) = 1. Thus, the limiting behavior of 
Pr^ (</3|iTP) depends on how f goes to 0, so that Pr 00 (</3|iTP) is nonrobust. | 

The real problem here is the semantics of proportion expressions in queries. While the 
utility of the ~ connective in expressing statistical information in the knowledge base should 
be fairly uncontroversial, its role in conclusions we might draw, such as (p in Example 4.20, is 
much less clear. The formal semantics we have defined requires that we consider all possible 
tolerances for a proportion expression in (p, so it is not surprising that nonrobustness is the 
usual result. One might argue that the tolerances in queries should be allowed to depend 
more closely on tolerances of expressions in the knowledge base. It is possible to formalize 
this intuition, as is done in (Roller & Halpern, 1992), to give an alternative semantics for 
dealing with proportion expressions in queries that often gives more reasonable behavior. 
Considerations of this alternative semantics would lead us too far afield here; rather, we 
focus for the rest of the section on first-order queries. 

In fact, our goal is to allow arbitrary first-order queries, even those that involve predi- 
cates of arbitrary arity and equality (although we still need to restrict the knowledge base 
to the unary language Cf). However, as the following example shows, quantifiers too can 
cause problems. 

Example 4.21: Let $ = {P, c} and consider KB X = Vx^P(x), KB 2 = ||P(z)|U ~i °> and 
cp = 3x P{x). It is easy to see that S°[KBi] = S°[KB2] = {(0, 1)}, and therefore the unique 
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maximum-entropy point in both is v = (0, 1). However, Pr 00 (</3|iT5i) is clearly 0, whereas 
Fx 00 ((p\KB2) is actually 1. To see the latter fact, observe that the vast majority of models 
of KB2 around v actually satisfy 3xP(x). There is actually only a single world associated 
with (0,1) at which 3x P(x) is false. This example is related to Example 4.17, because it 
illustrates another case in which S°[KB] cannot suffice to determine degrees of belief. | 

In the case of the knowledge base KBi, the maximum-entropy point (0,1) is quite 
misleading about the nature of nearby worlds. We must avoid this sort of "discontinuity" 
when finding the degree of belief of a formula that involves first-order quantifiers. The 
notion of stability defined below is intended to deal with this problem. To define it, we first 
need the following notion of a size description. 

Definition 4.22: A size description (over V) is a conjunction of K formulas: for each 
atom Aj over V, it includes exactly one of 3a; Aj{x) and -i3a; A 3 (x). For u £ A K , the size 
description associated with u, written cr(u), is that size description which includes ->3x Ai{x) 
if Ui = and 3a; Ai{x) if U{ > 0. I 

The problems that we want to avoid occur when there is a maximum-entropy point v 
with size description o~(v) such that in a neighborhood of v, most of the worlds satisfying 
KB are associated with other size descriptions. Intuitively, the problem with this is that the 
coordinates of v alone give us misleading information about the nature of worlds near v, and 
so about degrees of belief. 9 We give a sufficient condition which can be used to avoid this 
problem in the context of our theorems. This condition is effective and uses machinery (in 
particular, the ability to find solution spaces) that is needed to use the maximum-entropy 
approach in any case. 

Definition 4.23: Let v be a maximum-entropy point of S T [KB]. We say that v is safe 
(with respect to KB and f) if v is not contained in S T [KB A -icr(tT)]. We say that KB and 
f are stable for a* if for every maximum-entropy point v £ S T [KB] we have that o~(v) = a* 
and that v is safe with respect to KB and f. I 

The next result is the key property of stability that we need. 

Theorem 4.24: // KB and f > are stable for a* then Pv^a^KB) = 1. 

Our theorems will use the assumption that there exists some a* such that, for all suf- 
ficiently small f, KB and f are stable for a* . We note that this does not imply that a* is 
necessarily the size description associated with the maximum-entropy point(s) of S°[KB]. 

Example 4.25: Consider the knowledge base KB2 in Example 4.21, and recall that v = 
(0, 1) is the maximum-entropy point of S°[KB2]- The size description o~(v) is -i3a; A\{x) A 
3a; ^2(3;). However the maximum-entropy point of S T [KB2] for f > is actually (ri, 1 — 7~i), 
so that the appropriate a* for such a f is 3a; A\(x) A 3a; ^(a;). | 

9. We actually conjecture that problems of this sort cannot arise in the context of a maximum-entropy point 
of S T [KB] for r > 0. More precisely, for sufficiently small f and a maximum-entropy point v of S T [KB] 
with KB G , we conjecture that PrJ o [0](<j(£')|.&'.B) = 1 where O is an open set that contains v but 
no other maximum-entropy point of S T [KB]. If this is indeed the case, then the machinery of stability 
that we are about to introduce is unnecessary, since it holds in all cases that we need it. However, we 
have been unable to prove this. 
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As we now show, the restrictions outlined above and in Section 4.1 suffice for our next 
result on computing degrees of belief. In order to state this result, we need one additional 
concept. Recall that in Section 4.2 we expressed an essentially propositional formula (f(x) 
as a disjunction of atoms. Since we wish to also consider formulas (p using more than 
one constant and non-unary predicates, we need a richer concept than atoms. This is the 
motivation behind the definition of complete descriptions. 

Definition 4.26: Let Z be some set of variables and constants. A complete description D 
over $ and Z is an unquantified conjunction of formulas such that: 

• For every predicate R £ $ U { = } of arity r and for every z^, . . . , Zi r E Z , D contains 
exactly one of R(z^ , . . . , Zi r ) or -iR^z^ , . . . , Zi r ) as a conjunct. 

• D is consistent. 10 | 

Complete descriptions simply extend the role of atoms in the context of essentially proposi- 
tional formulas to the more general setting. As in the case of atoms, if we fix some arbitrary 
ordering of the conjuncts in a complete description, then complete descriptions are mutu- 
ally exclusive and exhaustive. Clearly, a formula £ whose free variables and constants are 
contained in Z, and which is is quantifier- and proportion-free, is equivalent to some dis- 
junction of complete descriptions over Z. For such a formula £, let -4.(£) be a set of complete 
descriptions over Z such that £ is equivalent to the disjunction V.De.4(f)-^> wnere % 1S the 
set of constants and free variables in £. 

For the purposes of the remaining discussion (except within proofs), we are interested 
only in complete descriptions over an empty set of variables. For a set of constants Z, we 
can view a description D over Z as describing the different properties of the constants in Z. 
In our construction, when considering a KB of the form ip A KB' which is separable with 
respect to a query (p, we define the set Z to contain precisely those constants in (p and in 
ip. In particular, this means that KB' will mention no constant in Z. 

A complete description D over a set of constants Z can be decomposed into three parts: 
the unary part D 1 which consists of those conjuncts of D that involve unary predicates 
(and thus determines an atom for each of the constant symbols), the equality part D = 
which consists of those conjuncts of D involving equality (and thus determines which of 
the constants are equal to each other), and the non-unary part D yl which consists of 
those conjuncts of D involving non-unary predicates (and thus determines the non-unary 
properties other than equality of the constants). As we suggested, the unary part of such 
a complete description D extends the notion of "atom" to the case of multiple constants. 
For this purpose, we also extend (for an atom A) and define for a description D. 
Intuitively, we are treating each of the individuals as independent, so that the probability 
that constant c\ satisfies atom Aj x and that constant C2 satisfies Aj 2 is just the product of 
the probability that c\ satisfies Aj x and the probability that C2 satisfies Aj 2 . 

Definition 4.27: For a complete description D without variables whose unary part is 
equivalent to Aj^c\) A ... A Aj m (c m ) (for distinct constants ci,...,c m ) and for a point 

10. Inconsistency is possible because of the use of equality. For example, if D includes z\ = Z2 as well as 
both R(zi,zs) and -ii?(z2,Z3), it is inconsistent. 
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u £ A , we define 

m 

1=1 

Note that is depends only on D 1 , the unary part of D. 

As we mentioned, we can extend our approach to deal with formulas <p that also use 
non-unary predicate symbols. Our computational procedure for such formulas uses the 
maximum-entropy approach described above combined with the techniques of (Grove et al., 
1993b). These latter were used in (Grove et al., 1993b) to compute asymptotic conditional 
probabilities when conditioning on a first-order knowledge base KBf . The basic idea in 
that case is as follows: To compute Pr oo (</3|iT5^ ), we examine the behavior of <p in finite 
models of KBf . We partition the models of KBf into a finite collection of classes such that 
(p behaves uniformly in each individual class. By this we mean that almost all worlds in the 
class satisfy <p or almost none do; i.e., there is a 0-1 law for the asymptotic probability of <p 
when we restrict attention to models in a single class. In order to compute Pr oo (</3|iT5^ ) we 
therefore identify the classes, compute the relative weight of each class (which is required 
because the classes are not necessarily of equal relative size), and then decide for each class 
whether the asymptotic probability of <p is zero or one. 

It turns out that much the same ideas continue to work in this framework. In this case, 
the classes are defined using complete descriptions and the appropriate size description a* . 
The main difference is that, rather than examining all worlds consistent with the knowledge 
base, we now concentrate on those worlds in the vicinity of the maximum-entropy points, as 
outlined in the previous section. It turns out that the restriction to these worlds affects very 
few aspects of this computational procedure. In fact, the only difference is in computing the 
relative weight of the different classes. This last step can be done using maximum entropy, 
using the tools described in Section 4.2. 

Theorem 4.28: Let <p be a formula in C~ and let KB = A KB 1 be an essentially positive 
knowledge base in Cf which is separable with respect to (p. Let Z be the set of constants 
appearing in <p> or in ip ( so that KB 1 contains none of the constants in Z) and let %^ be 
the formula f\ cc i e z c c> ■ Assume that there exists a size description a* such that, for 
all r > 0, KB and f are stable for a* , and that the space S°[KB] has a unique maximum- 
entropy point v. Then 

T,DeA(rl>A X *) F [D]( V ) 

if the denominator is positive. 

Since both <p and a* AD are first-order formulas and a* A D is precisely of the required form 
in (Grove et al., 1993b), then Pr 00 (</3|cr* A D) is either or 1, and we can use the algorithm 
of (Grove et al., 1993b) to compute this limit, in the time bounds outlined there. 

One corollary of the above is that the formula %^ holds with probability 1 given any 
knowledge base KB of the form we are interested in. This corresponds to a default assump- 
tion of unique names, a property often considered to be desirable in inductive reasoning 
systems. 
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While this theorem does represent a significant generalization of Theorem 4.11, it still 
has numerous restrictions. There is no question that some of these can be loosened to some 
extent, although we have not been able to find a clean set of conditions significantly more 
general than the ones that we have stated. We leave it as an open problem whether such a 
set of conditions exists. Of course, the most significant restriction we have made is that of 
allowing only unary predicates in the KB. This issue is the subject of the next section. 

5. Beyond unary predicates 

The random- worlds method makes complete sense for the full language £~ (and, indeed, for 
even richer languages). On the other hand, our application of maximum entropy is limited 
to unary knowledge bases. Is this restriction essential? While we do not have a theorem to 
this effect (indeed, it is not even clear what the wording of such a theorem would be), we 
conjecture that it is. 

Certainly none of the techniques we have used in this paper can be generalized signif- 
icantly. One difficulty is that, once we have a binary or higher arity predicate, we see no 
analogue to the notion of atoms and no canonical form theorem. In Section 3.2 and in the 
proof of Theorem 3.5, we discuss why it becomes impossible to get rid of nested quantifiers 
and proportions when we have non-unary predicates. Even considering matters on a more 
intuitive level, the problems seem formidable. In a unary language, atoms are useful be- 
cause they are simple descriptions that summarize everything that might be known about a 
domain element in a model. But consider a language with a single binary predicate R(x,y). 
Worlds over this language include all finite graphs (where we think of R(x,y) as holding if 
there is an edge from x to y). In this language, there are infinitely many properties that 
may be true or false about a domain element. For example, the assertions "the node x has 
m neighbors" are expressible in the language for each m. Thus, in order to partition the 
domain elements according to the properties they satisfy, we would need to define infinitely 
many partitions. Furthermore, it can be shown that "typically" (i.e., in almost all graphs 
of sufficiently great size) each node satisfies a different set of first-order properties. Thus, 
in most graphs, all the nodes are "different", so a partition of domain elements into a finite 
number of "atoms" makes little sense. It is very hard to see how the basic proof strat- 
egy we have used, of summarizing a model by listing the number of elements with various 
properties, can possibly be useful here. 

The difficulty of finding an analogue to entropy in the presence of higher- arity predicates 
is supported by results from (Grove et al., 1993a). In this paper we have shown that 
maximum entropy can be a useful tool for computing degrees of belief in certain cases, if 
the KB involves only unary predicates. In (Grove et al., 1993a) we show that there can be 
no general computational technique to compute degrees of belief once we have non-unary 
predicate symbols in the KB. The problem of finding degrees of belief in this case is highly 
undecidable. This result was proven without statistical assertions in the language, and in 
fact holds for quite weak sublanguages of first-order logic. (For instance, in a language 
without equality and with only depth-two quantifier nesting.) So even if there is some 
generalized version of maximum entropy, it will either be extremely restricted in application 
or will be useless as a computational tool. 
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6. Conclusion 

This paper has had two major thrusts. The first is to establish a connection between max- 
imum entropy and the random- worlds approach for a significant fragment of our language, 
one far richer than that considered by Paris and Vencovska (1989) or Shastri (1989). The 
second is to suggest that such a result is unlikely to obtain for the full language. 

The fact that we have a connection between maximum entropy and random worlds is 
significant. For one thing, it allows us to utilize all the tools that have been developed for 
computing maximum entropy efficiently (see (Goldman, 1987) and the further references 
therein), and may thus lead to efficient algorithms for computing degrees of belief for a large 
class of knowledge bases. In addition, maximum entropy is known to have many attractive 
properties (Jaynes, 1978). Our result shows these properties are shared by the random- 
worlds approach in the domain where these two approaches agree. Indeed, as shown in 
(Bacchus et al., 1994), the random- worlds approach has many of these properties for the 
full (non-unary) language. 

On the other hand, a number of properties of maximum entropy, such as its dependence 
on the choice of language and its inability to handle causal reasoning appropriately, have 
been severely criticized (Pearl, 1988; Goldszmidt et al., 1990). Not surprisingly, these 
criticisms apply to random worlds as well. A discussion of these criticisms, and whether 
they really should be viewed as shortcomings of the random- worlds method, is beyond the 
scope of this paper; the interested reader should consult (Bacchus et al., 1994, Section 7) 
for a more thorough discussion of these issues and additional references. 

We believe that our observations regarding the limits of the connection between the 
random- worlds method and maximum entropy are also significant. The question of how 
widely maximum entropy applies is quite important. Maximum entropy has been gaining 
prominence as a means of dealing with uncertainty both in Al and other areas. However, 
the difficulties of using the method once we move to non-unary predicates seem not to 
have been fully appreciated. In retrospect, this is not that hard to explain; in almost all 
applications where maximum entropy has been used (and where its application can be best 
justified in terms of the random- worlds method) the knowledge base is described in terms 
of unary predicates (or, equivalently, unary functions with a finite range). For example, in 
physics applications we are interested in such predicates as quantum state (see (Denbigh 
& Denbigh, 1985)). Similarly, Al applications and expert systems typically use only unary 
predicates such as symptoms and diseases (Cheeseman, 1983). We suspect that this is not an 
accident, and that deep problems will arise in more general cases. This poses a challenge to 
proponents of maximum entropy since, even if one accepts the maximum-entropy principle, 
the discussion above suggests that it may simply be inapplicable in a large class of interesting 
examples. 

Appendix A. Proofs for Section 3.2 

Theorem 3.5: Every formula in £f is equivalent to a formula in canonical form. More- 
over, there is an effective procedure that, given a formula £ £ constructs an equivalent 
formula £ in canonical form. 
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Proof: We show how to effectively transform £ £ to an equivalent formula in canonical 
form. We first rename variables if necessary, so that all variables used in £ are distinct 
(i.e., no two quantifiers, including proportion expressions, ever bind the same variable sym- 
bol). 

We next transform £ into an equivalent flat formula £f £ Cf, where a flat formula 
is one where no quantifiers (including proportion quantifiers) have within their scope a 
constant or variable other than the variable(s) the quantifier itself binds. (Note that in this 
transformation we do not require that £ be closed. Also, observe that flatness implies that 
there are no nested quantifiers.) 

We define the transformation by induction on the structure of £. There are three easy 
steps: 

• If £ is an unquantified formulas, then £f = £. 

. (e ' v Of = e f v q 

All that remains is to consider quantified formulas of the form 3x£', or It 

turns out that the same transformation works in all three cases. We illustrate the transfor- 
mation by looking at the case where £ is of the form ||£'||^. By the inductive hypothesis, we 
can assume that £' is flat. For the purposes of this proof, we define a basic formula to be an 
atomic formula (i.e., one of the form P(z)), a proportion formula, or a quantified formula 
(i.e., one of the form 3a; %). Let %i, be all basic subformulas of £' that do not mention 
any variable in x. Let z be a variable or constant symbol not in x that is mentioned in 
Clearly z must occur in some basic subformula of say %'. By the inductive hypothesis, 
it is easy to see that %' cannot mention any variable in x and so, by construction, it is in 
{Xi> • • • > Xi}- I n other words, not only do {xi, ■ ■ ■ , Xi} n °t mention any variable in x, but 
they also contain all occurrences of the other variables and constants. (Notice that this 
argument fails if the language contains any high-arity predicates, including equality. For 
then £' might include subformulas of the form R(x,y) or x = y, which can mix variables 
outside x with those in x.) 

Now, let Bi, . . . , B 2 i be all the "atoms" over %i, . . . , xi- That is, we consider all formulas 
Xi A . . . A x'i where x'i 1S either Xi or "'Xi- Now consider the disjunction: 

\/(B z A\\e\y). 

i=l 

This is surely equivalent to ||£'||^, because some Bi must be true. However, if we assume 
that a particular Bi is true, we can simplify ||£'||^ by replacing all the Xi subformulas by 
true or false, according to Bi. (Note that this is allowed only because the Xi do not mention 
any variable in x). The result is that we can simplify each disjunct (5iA||£'||^) considerably. 
In fact, because of our previous observation about {xi, ■ ■ - ,Xt}i there will be no constants 
or variables outside x left within the proportion quantifier. This completes this step of 
the induction. Since the other quantifiers can be treated similarly, this proves the flatness 
result. 
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It now remains to show how a flat formula can be transformed to canonical form. Sup- 
pose £ G Cf is flat. Let £* £ £f be the formula equivalent to £ obtained by using the 
translation of Section 2.1. Every proportion comparison in £* is of the form t < t'si where t 
and t' are polynomials over flat unconditional proportions. In fact, t' is simply a product of 
flat unconditional proportions (where the empty product is taken to be 1). Note also that 
since we cleared away conditional proportions by multiplying by t' , if t' = then so is t, 
and so the formula t < t'si is automatically true. We can therefore replace the comparison 
by (t 1 = 0) V (t < t'si At' > 0). Similarly, we can replace a negated comparison by an 
expression of the form ->(t < t'e{) At' > 0. 

The next step is to rewrite all the flat unconditional proportions in terms of atomic 
proportions. In any such proportion ||£'||^, the formula £' is a Boolean combination of 
P(xi) for predicates P £ V and X{ £ x. Thus, the formula £' is equivalent to a disjunction 
\J -[A^x^) A ... A A 3 m {xi m y), where each A\ is an atom over V and x = {x^, . . - ,Xi m }. 
These disjuncts are mutually exclusive and the semantics treats distinct variables as being 
independent, so 

m 

iie'ik = EIIii^)iu- 

3 *=1 

We perform this replacement for each proportion expression. Furthermore, any term t' in 
an expression of the form t < t'si will be a product of such expressions, and so will be 
positive. 

Next, we must put all pure first-order formulas in the right form. We first rewrite £ to 
push all negations inwards as far as possible, so that only atomic subformulas and existential 
formulas are negated. Next, note that since £ is flat, each existential subformula must have 
the form 3x£', where £' is a quantifier-free formula which mentions no constants and only 
the variable x. Hence, £' is a Boolean combination of P(x) for predicates P £ V . Again, 
the formula £' is equivalent to a disjunction of atoms of the form \l AeA{£) A(x), so 3a; £' is 
equivalent to \J AeA(i) A(x). We replace 3x £' by this expression. Finally, we must deal 
with formulas of the form P(c) or ^P(c) for P £ V . This is easy: We can again replace a 
formula £ of the form P(c) or ^P(c) by the disjunction \/ aeA(£) ^-( c )- 

The penultimate step is to convert £ into disjunctive normal form. This essentially brings 
things into canonical form. Note that since we dealt with formulas of the form ^P(c) in 
the previous step, we do not have to deal with conjuncts of the form ^Ai(c). 

The final step is to check that we do not have Ai(c) and either ->3x Ai{x) or Aj(c) for 
some j ' ^ i as conjuncts of some disjunct. If we do, we simply remove that disjunct. | 



Appendix B. Proofs for Section 3.3 

Lemma 3.11: There exist some function h : M — ► M and two strictly positive polynomial 
functions f, g : W — ► M such that, for KB £ and u £ A K , if j^worlds T N \u][KB) ^ 0, 
then 

{h{N)lf{N))e NH W < #worlds f N [u](KB) < h(N)g(N)e NH ^ . 

Proof: To choose a world W £ Wn satisfying KB such that tt(W) = u, we must partition 
the domain among the atoms according to the proportions in u, and then choose an assign- 
ment for the constants in the language subject to the constraints imposed by KB. Finally, 
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even though KB mentions only unary predicates, if there are any non-unary predicates in 
the vocabulary we must choose a denotation for them. 

Suppose u = (ui, . . .,uk), and let Ni = u^N for i = 1, . . .,K. The number of parti- 
tions of the domain into atoms is N n k ) ' eacn sucn partition completely determines the 
denotation for the unary predicates. We must also specify the denotations of the constant 
symbols. There are at most N^^ ways of choosing these. On the other hand, we know there 
is at least one model (W, f) of KB such that ir(W) = u, so there there at least one choice. In 
fact, there is at least one world W £ Wn such that (W, f) |= KB for each of the N ' N ) 
ways of partitioning the elements of the domain (and each such world W' is isomorphic to 
W). Finally we must choose the denotation of the non-unary predicates. However, u does 
not constrain this choice and, by assumption, neither does KB. Therefore the number of 
such choices is some function h(N) which is independent of u. 11 We conclude that: 

h(N)^ N < #worlds%[u](KB) < h(N)N^ 

It remains to estimate 

JV! 





Ni\N 2 \ . ..N K \ 

To obtain our result, we use Stirling's approximation for the factorials, which says that 

m! = V2vrmm m e- m (l + 0(l/m)). 
It follows that exist constants L, U > such that 

Lm m e- m < m\ < Umm m e- m 
for all m. Using these bounds, as well as the fact that Ni < N , we get: 

L N N Uf=i e Ni N\ UN N N Y\f=i e Ni 



U K N K e N Y\f =1 N t Ni ~ N!\N 2 \ . ..N K \ ~ L K e N \\f =1 N t Ni ' 
Now, consider the expression common to both bounds: 

- S 

K 

= JJ £ Ni ln(N/Ni) 
i=l 



11. It is easy to verify that in fact 



N arity{R) 



h(N)= n 2 1 

where ^ is the unary fragment of $ and arity(R) denotes the arity of the predicate symbol R. 
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We obtain that 

$^e»*W < #worlds ¥ N [u](KB) < N^h(N)^e NH ^, 

which is the desired result. | 

We next want to prove Theorem 3.13. To do this, it is useful to have an alternative 
representation of the solution space S T [KB]. Towards this end, we have the following 
definition. 

Definition B.l: Let W N [KB] = {ir(W) : W G W n ;(W,t) |= KB}. Let IL^KB] be the 
limit of these spaces. Formally, 

Ht,[KB] = {u : 3N s.t. ViV" > N 3u N G H ¥ N [KB] s.t. lim u N = u}. I 

N— »oo 

The following theorem establishes a tight connection between S T [KB] and n^fiTB]. 
Theorem B.2: 

(a) For all N and f, we have W N [KB] C S ¥ [KB]. 

(b) For all sufficiently small f, we have Tl^lKB] = S T [KB]. 

Proof: Part (a) is immediate: If u G n^-[iT5], then u = tt(W) for some W G Wn such 
that (W,f) |= KB. It is almost immediate from the definitions that tt(W) must satisfy 
T(KB[t]), so tt(W) G SoI[T(KB[t])]. The inclusion W N [KB] C S ¥ [KB] now follows. 

One direction of part (b) follows immediately from part (a). Recall that n^-[iT.B] C 
S T [KB] and that the points in n^fiTB] are limits of a sequence of points in n^-[iT.B]. Since 
S ¥ [KB] is closed, it follows that 11^ [KB] C S ¥ [KB]. 

For the opposite inclusion, the general strategy of the proof is to show the following: 



(i) If f is sufficiently small, then for all u G S T [KB], there is some sequence of points 
[u No ,u No+1 ,u No+2 ,u No+3 , . . .} C SoI[T(KB[t])] such that, for all N > N , the coor- 
dinates of u N are all integer multiples of 1 /N and limjy^oo 

u N = u. 



(ii) if w G 5o/[r(iT5[f])] and all its coordinates are integer multiples of 1/N, then w G 
W N [KB]. 

This clearly suffices to prove that u G n^fiTS]. 

We begin with the proof of (ii), which is straightforward. Suppose the point 
w = (r 1 /N,r 2 /N, . . .,r K /N) is in Sol[T(KB[f])]. We construct a world W G Wn such 
that ir(W) = w as follows. The denotation of atom A\ is the set of elements {1, . . 
the denotation of atom A 2 is the set {r\ + 1, . . . , r\ + r 2 }, and so on. It remains to choose 
the denotations of the constants (since the denotation of the predicates of arity greater 
than 1 is irrelevant). Without loss of generality we can assume KB is in canonical form. 
(If not, we consider KB.) Thus, KB is a disjunction of conjunctions, say Vj^j- Since 
w G 5o/[r(iT5[r])], we must have w G 5' ^[r(^j[f'])] for some j. We use £j to define the 
properties of the constants. If £j contains Ai(c) for some atom Ai, then we make c satisfy 
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A{. Note that, by Definition 3.6, if £j has such a conjunct then U{ > 0. If £j contains no 
atomic conjunct mentioning the constant c, then we make c satisfy A{ for some arbitrary 
atom with U{ > 0. It should now be clear that (W,f) satisfies £j, and so satisfies KB. Note 
that in this construction it is important that we started with w in 5o/[r(iT5[f])], rather 
than just in the closure space S T [KB]; otherwise, the point would not necessarily satisfy 

We now consider condition (i). This is surprisingly difficult to prove; the proof involves 
techniques from algebraic geometry. Our job would be relatively easy if 5o/[r(iT5[f])] were 
an open set. Unfortunately, it is not. On the other hand, it would behave essentially like 
an open set if we could replace the occurrences of < in T(KB[f]) by <. It turns out that, 
for our purposes here, this replacement is possible. 

Let T < (KB[f]) be the same as T(KB[f]) except that every (unnegated) conjunct of 
the form (t < Tit') is replaced by (t < Tit'). (Notice that this is essentially the opposite 
transformation to the one used when defining essential positivity in Definition 4.4.) Finally, 
let S <¥ [KB] be SoI[T<(KB[t])]. It turns out that, for all sufficiently small f, S <¥ [KB] = 
S T [KB]. This result, which we label as Lemma B.5, will be stated and proved later. For 
now we use the lemma to continue the proof of the main result. 

Consider some u £ S T [KB]. It suffices to show that for all 6 > there exists Nq such 
that for all N > Nq, there exists a point u N £ 5o/[r < (iT5[f])] such that all the coordinates 
of u N are integer multiples of 1/N and such that \u — u N \ < 6. (For then we can take smaller 
and smaller <5's to create a sequence u N converging to u.) Hence, let 6 > 0. By Lemma B.5, 
we can find some u £ 5o/[r < (iT5[r])] such that \u — u'\ < 6/2. By definition, every conjunct 
in T < (KB[f]) is of the form q'(w) = 0, q'(w) > 0, q(w) < Tiq'(w), or q(w) > Tiq'(w), where 
q' is a positive polynomial. Ignore for the moment the constraints of the form q'(w) = 0, 
and consider the remaining constraints that u' satisfies. These constraints all involve strict 
inequalities, and the functions involved (q and q') are continuous. Thus, there exists some 
S' > such that for all w for which \u' — w\ < 6' , these constraints are also satisfied by w. 
Now consider a conjunct of the form q'(w) = that is satisfied by u' . Since q' is positive, this 
happens if and only if the following condition holds: for every coordinate Wi that actually 
appears in q' , we have u[ = 0. In particular, if w and u' have the same coordinates with 
value 0, then q'(w) = 0. It follows that for all w, if \u — w\ < 6' and u and w have the 
same coordinates with value 0, then w also satisfies T < (KB[f]). 

We now construct u N that satisfies the requirements. Let i* be the index of that 
component of u' with the largest value. We define u N by considering each of its components 
uf , for 1 < i < K: 

u'i = 

\Nu' i \ /N i^i* and u[ > 

It is easy to verify that the components of u N sum to 1. All the components in u', other 
than the i*'th, are increased by at most 1/N. The component uf* is decreased by at most 
K/N . We will show that u N has the right properties for all N > Nq, where Nq is such that 
1/N < mm(u l *,6/2,6')/2K. The fact that K/N < u t * guarantees that u N is in A K for 
all N > Nq. The fact that 2K/Nq < 6/2 guarantees that u N is within 6/2 of u', and hence 
within 6 of u. Since 2K/Nq < 6' , it follows that \u — u N \ < 6' . Since u N is constructed 



,N 
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to have exactly the same coordinates as u, we conclude that u N £ 5o/[r < (iT5[f])], as 
required. Condition (i), and hence the entire theorem, now follows. | 

It now remains to prove Lemma B.5, which was used in the proof just given. As we 
hinted earlier, this requires tools from algebraic geometry. We base our definitions on the 
presentation in (Bochnak, Coste, & Roy, 1987). A subset A of is said to be semi-algebraic 
if it is definable in the language of real-closed fields. That is, A is semi- algebraic if there is 
a first-order formula (f(xi, . . . , xi) whose free variables are x\, . . . , xi and whose only non- 
logical symbols are 0, 1, +, X, < and =, such that M |= (f(ui, . . .,ut) iff (ui, . . .,ut) £ A. 12 
A function / : X Y , where X C M h and Y C R l , is said to be semi- algebraic if its graph 
(i.e., {(u, w) : f{u) = w}) is semi- algebraic. The main tool we use is the following Curve 
Selection Lemma (see (Bochnak et al., 1987, p. 34)): 

Lemma B.3: Suppose that A is a semi-algebraic set in M i and u £ A. Then there exists 
a continuous, semi- algebraic function f : [0, 1] — ► M l such that /(0) = u and f(t) £ A for 
all t £ (0, 1]. 

Our first use of the Curve Selection Lemma is in the following, which says that, in a 
certain sense, semi- algebraic functions behave "nicely" near limits. The type of phenomenon 
we wish to avoid is illustrated by x sin ^ which is continuous at 0, but has infinitely many 
local maxima and minima near 0. 



Proposition B.4: Suppose that g : [0,1] 
such that g{u) > if u > and g(0) = 
strictly increasing in the interval [0,e]. 



M is a continuous, semi- algebraic function 
Then there exists some e > such that g is 



Proof: Suppose, by way of contradiction, that g satisfies the hypotheses of the proposition 
but there is no e such that g is increasing in the interval [0, e]. We define a point u in [0, 1] 
to be bad if for some u' £ [0, u) we have g(u') > g(u). Let A be the set of all the bad points. 
Since g is semi- algebraic so is A, since u' £ A iff 

3u'((0<u' <u)A(g(u)<g(u'))). 

Since, by assumption, g is not increasing in any interval [0,e], we can find bad points 
arbitrarily close to and so £ A. By the Curve Selection Lemma, there is a continuous 
semi- algebraic curve / : [0, 1] —>■ M such that /(0) = and f{t) £ A for all t £ (0,1]. 
Because of the continuity of /, the range of /, i.e., f([0, 1]), is [0,r] for some r £ [0, 1]. By 
the definition of /, (0,r] C A. Since £" A, it follows that /(l) 7^ 0; therefore r > and so, 
by assumption, g{r) > 0. Since g is a continuous function, it achieves a maximum v > 
over the range [0,r]. Consider the minimum point in the interval where this maximum is 
achieved. More precisely, let u be the infimum of the set {u' £ [0,r] : g(u') = v}; clearly, 
g{u) = v. Since v > we obtain that u > and therefore u £ A. Thus, u is bad. But that 
means that there is a point u' < u for which g(u') > g(u), which contradicts the choice of 
v and u. | 

We can now prove Lemma B.5. Recall, the result we need is as follows. 

12. In (Bochnak et al., 1987), a set is taken to be semi-algebraic if it is definable by a quantifier-free formula 
in the language of real closed fields. However, as observed in (Bochnak et al., 1987), since the theory of 
real closed fields admits elimination of quantifiers (Tarski, 1951), the two definitions are equivalent. 
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Lemma B.5: For all sufficiently small f, S <T [KB] = S T [KB]. 

Proof: Clearly S <T [KB] C S T [KB]. To prove the reverse inclusion we consider KB, 
a canonical form equivalent of KB. We consider each disjunct of KB separately. Let 
£ be a conjunction that is one of the disjuncts in KB. It clearly suffices to show that 
SoZ[r(£[r])] C 5' <T [£] = 5oZ[r < (£[r])]. Assume, by way of contradiction, that for arbitrarily 
small f, there exists some u £ SoZ[r(£[f])] which is "separated" from the set 5o/[r < (£[r])], 
i.e., is not in its closure. More formally, we say that u is b -separated from 5o/[r < (£[r])] if 
there is no v! £ 5o/[r < (£[r])] such that \u — v!\ < S. 

We now consider those f and those points in SoZ[r(£[r])] that are separated from 

soi[T<(am-- 13 

A = {(f,u,6) : f > 0, 6 > 0, u £ SoI[T(£[t\)] is ^-separated from SoZ[r<(f [f])]}. 

Clearly A is semi- algebraic. By assumption, there are points in A for arbitrarily small 
tolerance vectors f. Since A is a bounded subset of M m+K+1 (where m is the number of 
tolerance values in r), we can use the Bolzano- Weierstrass Theorem to conclude that this 
set of points has an accumulation point whose first component is 0. Thus, there is a point 
(0, w, S') in A. By the Curve Selection Lemma, there is a continuous semi- algebraic function 
/ : [0, 1] -»■ M m+K+1 such that /(0) = (0, w, 6 1 ) and f(t) £ A for t £ (0, 1]. 

Since / is semi- algebraic, it is semi- algebraic in each of its coordinates. By Lemma B.4, 
there is some v > such that / is strictly increasing in each of its first m coordinates over 
the domain [0,v]. Suppose that f(v) = (f,u,6). Now, consider the constraints in r(£[f]) 
that have the form q(w) > Tjq'(w). These constraints are all satisfied by u and they all 
involve strong inequalities. By the continuity of the polynomials q and q' , there exists some 
e > such that, for all u such that \u — u'\ < e, u also satisfies these constraints. 

Now, by the continuity of /, there exists a point v' £ (0,v) sufficiently close to v 
such that if f(v') = ^f',u',S'), then \u — u'\ < min(<5, e). Since f(v) = (f,u,6) £ A and 
\u — u'\ < 6, it follows that u £" 5o/[r < (£[r])]. We conclude the proof by showing that this is 
impossible. That is, we show that v! £ 5o/[r < (£[r])]. The constraints appearing in r < (£[f]) 
can be of the following forms: q'(w) = 0, q'(w) > 0, q(w) < Tjq'(w), or q(w) > Tjq'(w), 
where q' is a positive polynomial. Since f(v') £ A, we know that v! £ 5oZ[r(£[r'])]. The 
constraints of the form q'(w) = and q'(w) > are identical in r(£[r']) and in r < (£[r]), 
and are therefore satisfied by v! . Since \v! — u\ < e, our discussion in the previous paragraph 
implies that the constraints of the form q(w) > Tjq'(w) are also satisfied by v! . Finally, 
consider a constraint of the form q(w) < Tjq'(w). The corresponding constraint in r(£[r / ]) 
is q(w) < Tjq'(w). Since v! satisfies this latter constraint, we know that q{v!) < rjg'(u'). 
But now, recall that we proved that / is increasing over [0,v] in the first m coordinates. 
In particular, rj < Tj. By the definition of canonical form, q'(u') > 0, so that we conclude 
q{u ) < r'-q'iu ) < Tjq'{u ). Hence the constraints of this type are also satisfied by u . This 
concludes the proof that v! £ 5o/[r < (iT5[f])], thus deriving a contradiction and proving 
the result. | 

We are finally ready to prove Theorem 3.13. 
13. We consider only those components in the infinite vector f that actually appear in 5 , oZ[r(^[r|)]. 
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Theorem 3.13: For all sufficiently small f, the following is true. Let Q be the points 
with greatest entropy in S T [KB] and let O C M K be any open set containing Q. Then for 
all 6 £ £~ and for lim* £ {limsup,liminf} we have 

liM-Prt(#| M )= ^.^UsUOKO^KB) 



#worlds T N [0]{KB) 

Proof: Let f be small enough so that Theorem B.2 applies and let Q and O be as in the 
statement of the theorem. It clearly suffices to show that the set O contains almost all of 
the worlds that satisfy KB. More precisely, the fraction of such worlds that are in O tends 
to 1 as N — s- oo. 

Let p be the entropy of the points in Q. We begin the proof by showing the existence 
of pl < pjj (< p) such that (for sufficiently large N) (a) every point u £ LT^-fiTi?] where 
u £" O has entropy at most p^ and (b) there is at least one point u £ n^-[iT5] with u £ O 
and entropy at least pjj. 

For part (a), consider the space S T [KB] — O. Since this space is closed, the entropy 
function takes on a maximum value in this space; let this be p^. Since this space does 
not include any point with entropy p (these are all in Q C O), we must have p^ < p. 
By Theorem B.2, n^-[iT5] C S T [KB]. Therefore, for any N, the entropy of any point in 
W N [KB] - O is at most p L . 

For part (b), let pjj be some value in the interval (pl,p) (f° r example (p^ + p)/2) and 
let v be any point in Q. By the continuity of the entropy function, there exists some 6 > 
such that for all u with \u — v\ < 6, we have H{u) > pjj. Because O is open we can, by 
considering a smaller 6 if necessary, assume that \u — v\ < 6 implies u £ O. By the second 
part of Theorem B.2, there is a sequence of points u N £ n^-[iT5] such that limjy^oo 
In particular, for N large enough we have \u N — v\ < 6, so that H{u N ) > pjj, proving part 
(b). 

To complete the proof, we use Lemma 3.11 to conclude that for all N, 

#worlds%{KB) > #worlds%[u N ]{KB) > (h(N)/ f(N))e NH ^ > (h(N)/ f{N))e Npu . 

On the other hand, 

#worlds^[A K - 0](KB) < ^ #worlds f N [u\(KB) 

Zen^[KB]-o 

< \{weIl f N [KB}:w^O}\h(N)g(N)e NpL 

< (N + l) K h(N)g(N)e NpL . 

Therefore the fraction of models of KB which are outside O is at most 

(N + l) K h(N)f(N)g(N)e Np L = (N + 1) K f(N)g(N) 
h(N)e N Pu e N (pu~PL) 

Since (N + l) fc f(N)g(N) is a polynomial in N , this fraction tends to as N grows large. 
The result follows. | 
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Appendix C. Proofs for Section 4 

Proposition 4.6: Assume that KB is essentially positive and let Q be the set of maximum- 
entropy points of S°[KB] (and thus also of S-°[KB]). Then for all e > and all sufficiently 
small tolerance vectors f (where "sufficiently small" may depend on e), every maximum- 
entropy point of S T [KB] is within e of some maximum entropy-point in Q. 

Proof: Fix e > 0. By way of contradiction, assume that that there is some sequence 
of tolerance vectors f™, m = 1,2, . . ., that converges to 0, and for each m a maximum- 
entropy point u m of S T [KB] such that for all m, u m is at least e away from Q. Since 
the space A K is compact, we can assume without loss of generality that this sequence 
converges to some point u. Recall that T(KB) is a finite combination (using "and" and 
"or") of constraints, where every such constraint is of the form q'(w) = 0, q'(w) > 0, 
q(w) < Sjq'(w), or q(w) > Sjq'(w), such that q' is a positive polynomial. Since the overall 
number of constraints is finite we can assume, again without loss of generality, that all the 
u m 's satisfy precisely the same constraints. We claim that the corresponding conjuncts in 
T-(KB[0]) are satisfied by u. For a conjunct of the form q'(w) = note that, if q'(u m ) = 
for all m, then this also holds at the limit, so that q(u) = 0. A conjunct of the form q'(w) > 
translates into q'(w) > in T-(KB[0]); such conjuncts are trivially satisfied by any point 
in A K . If a conjunct of the form q(w) < Sjq'(w) is satisfied for all u m and f™, then at 
the limit we have q(u) < 0, which is precisely the corresponding conjunct in T-(KB[0]). 
Finally, for a conjunct of the form q(w) > Sjq'(w), if q(u m ) > T™g'(u m ) for all m, then at 
the limit we have q{u) > 0, which again is the corresponding conjunct in T-(KB[6]). It 
follows that u is in S^ d [KB]. 

By assumption, all points u m are at least e away from Q. Hence, u cannot be in Q. 
If we let p represent the entropy of the points in Q, since Q is the set of all maximum- 
entropy points in S-°[KB], it follows that H{u) < p. Choose pi, and pjj such that H{u) < 
PL < Pu < P- Since the entropy function is continuous, we know that for sufficiently 
large m, H(u m ) < pi,. Since u m is a maximum-entropy point of S T [KB], it follows 
that the entropy achieved in this space for sufficiently large m is at most p^. We derive a 
contradiction by showing that for sufficiently large m, there is some point in Sol[T(KB[f m ])] 
with entropy at least pjj. The argument is as follows. Let v be some point in Q. Since v 
is a maximum-entropy point of S°[KB], there are points in 5o/[r(iT5[0])] arbitrarily close 
to v. In particular, there is some point v! £ 5o/[r(iT5[0])] whose entropy is at least pjj. 
As we now show, this point is also in 5o/[r(iT5[f])] for all sufficiently small f. Again, 
consider all the conjuncts in r(iT5[0]) satisfied by v! and the corresponding conjuncts in 
T(KB[f]). Conjuncts of the form q'(w) = and q'(w) > in r(iT5[0]) remain unchanged 
in T(KB[f]). Conjuncts of the form q(w) < Tjq'(w) in T(KB[f]) are certainly satisfied 
by u', since the corresponding conjunct in T(KB[0]), namely q(w) < 0, is satisfied by u', 
so that q{v!) < < Tjq'{v!) (recall that q' is a positive polynomial). Finally, consider a 
conjunct in T(KB[f]) of the form q(w) > Tjq'(w). The corresponding conjunct in r(iT5[0]) 
is q(w) > 0. Suppose q(u) = 6 > 0. Since the value of q' is bounded over the compact 
space A-^, it follows that for all sufficiently small Tj, Tjq'{v!) < 8. Thus, q{v!) > Tjq'{v!) for 
all sufficiently small tj, as required. It follows that u is in 5o/[r(iT5[r])] for all sufficiently 
small f and, in particular, in 5o/[r(iT5[r m ])] for all sufficiently large m. But H{v!) > pjj, 
whereas we showed that the maximum entropy achieved in S T [KB] is at most pi, < pjj. 
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This contradiction proves that our assumption was false, so that the conclusion of the 
proposition necessarily holds. | 

Theorem 4.9: Suppose (f(c) is a simple query for KB. For all f sufficiently small, if Q 
is the set of maximum- entropy points in S T [KB] and F^(v) > for all v £ Q, then for 
lim* £ {limsup,liminf} we have 



Km*Pi T N (<p(c)\KB)e 

iv — »oo 



Proof: Let W £ W* , and let u = tt(W). The value of the proportion expression | |-0(cc)| 1^ 
at W is clearly 

J2 WM x )\\x= J2 u j = F m(u)- 

AjEA(if>) AjEA(if>) 

If F^{u) > 0, then by the same reasoning we conclude that the value of ||<p(a;)|'^(a;)|| a ; at 
W is equal to F[ v ^{u). 

Now, let Ax, and A# be inf^gg Fy^^v) and sup^ £ Q Fy^^v) respectively; by our as- 
sumption, Fy^^v) is well-defined for all v £ Q. Since the denominator is not 0, -P^i^/,] is 
a continuous function at each maximum-entropy point. Thus, since Fy^^v) £ [Ax,, Ar] for 
all maximum-entropy points, the value of F^^(u) for u "close" to some v £ Q, will either 
be in the range [Ax,, Ay] or very close to it. More precisely, choose any e > 0, and define 
9[e] to be the formula 

11^(^)1^(^)11^ G [A L -e,A^ + e]. 

Since e > 0, it is clear that there is some sufficiently small open set O around Q such 
that this proportion expression is well-defined and within these bounds at all worlds in O. 
Thus, by Corollary 3.14, Pr^(6»[e]|ir5) = 1. Using Theorem 3.16, we obtain that for lim* 
as above, 

lim* V/ N {v{c)\KB) = lim* Pi ¥ N (<p(c)\KB A 0[e]). 

A/— »oo A/— »oo 

But now we can use the direct inference technique outlined earlier. We are interested in 
the probability of y(c), where the only information we have about c in the knowledge base 
is ip(c) and where we have statistics for ||<p(a;)|'^(a;)|| a ;. These are precisely the conditions 
under which Theorem 4.1 applies. We conclude that 

lim* F/ N (v(c)\KB) £ [A L -e,Au + e]. 

N— »oo 

Since this holds for all e > 0, it is necessarily the case that 

lim*Pr^(c)|in?)£ [A L , Ay], 

iv — »oo 

as required. | 

Theorem 4.11: Suppose (p(c) is a simple query for KB. If the space S°[KB] has a unique 
maximum- entropy point v, KB is essentially positive, and F^(v) > 0, then 

Px^i^KB) = F [vW {v). 
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Proof: Note that the fact that S°[KB] has a unique maximum-entropy point does not 
guarantee that this is also the case for S T [KB]. However, Proposition 4.6 implies that 
the maximum-entropy points of the latter space are necessarily close to v. More precisely, 
if we choose some e > 0, we conclude that for all sufficiently small f, all the maximum- 
entropy points of S T [KB] will be within e of v. Now, pick some arbitrary 6 > 0. Since 
F^(v) > 0, it follows that F^^ is continuous at v. Therefore, there exists some e > 
such that if u is within e of v, Fy^{u) is within 6 of Fy^^v). In particular, this is the 
case for all maximum-entropy points of S T [KB] for all sufficiently small f. This allows 
us to apply Theorem 4.9 and conclude that for all sufficiently small f and for lim* £ 
{limsup,liminf}, lim^-^^ Pr^-(</?(c)|iT.B) is within 6 of Fy^^v). Hence, this is also the 
case for lim^g lim^-^^ Pr^-(</?(c)|iT.B). Since this holds for all ^ > 0, it follows that 

lim liminf Fi%((p(c)\KB) = lim lim sup Fi%((p(c)\KB) = F { m (v). 
Thus, by definition, Pr 00 (vj(c)|iT5) = F [(p \^{v). | 

Theorem 4.14: Let A be a conjunction of constraints of the form Pr(/3|/3') = A or 
Pr(/3|/3') £ [Ai,A2]. There is a unique probability distribution fi* of maximum entropy 
satisfying A. Moreover, for all f3 and f3' , i/Pr /ti *(/3 / ) > 0, then 

Proo(&(c)|&'(c) A KB'[A]) = P V W)- 

Proof: Clearly, the formulas (f(x) = £p(x) and ip(x) = £/3'(x) are essentially propositional. 
The knowledge base KB'[A] is in the form of a conjunction of simple proportion formulas, 
none of which are negated. As a result, the set of constraints associated with KB = 
ip(c) A KB'[A] also has a simple form. KB'[A] generates a conjunction of constraints which 
can be taken as having the form q(w) < Ejq'(w). On the other hand, ip(c) generates 
some Boolean combination of constraints all of which have the form Wj > 0. We begin by 
considering the set S-°[KB] (rather than S°[KB]), so we can ignore the latter constraints 
for now. 

S-°[KB] is defined by a conjunction of linear constraints which (as discussed earlier) 
implies that it is convex, and thus has a unique maximum-entropy point, say v. Let fi* = ^ 
be the distribution over corresponding to v. It is clear that the constraints of T-(KB[0]) 
on the points of A K are precisely the same ones as those of A. Therefore, /j,* is the unique 
maximum-entropy distribution satisfying the constraints of A. By Remark 4.13, it follows 
that F[£ ](v) = fj,*(f3'). Since we have assumed that fi*(/3') > 0, we are are almost in a 
position to use Theorem 4.11. It remains to prove essential positivity. 

Recall that the difference between T^(KB[0]) and T(KB[0]) is that the latter may have 
some conjuncts of the form Wj > 0. Checking definitions 3.4 and 3.6 we see that such terms 
can appear only due to £/3'(c) and, in fact, together they assert that F[£ ](w) > 0. But we 

have assumed that F^^v) > and so v is a maximum-entropy point of S°[KB] as well. 
Thus, essential positivity holds and so, by Theorem 4.11, 

P r oo(y(c)|V'( c ) A KB' [A]) = F [vW (fi*) = P V W) 

as required. | 
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Theorem 4.15: Let c be a constant symbol. Using the translation described in Section 4-3, 
for a set 1Z of defeasible rules, B — ► C is an ME-plausible consequence of 1Z iff 



Proo (W) &j(c) A/\fl r ]=l. 



Proof: Let KB' denote Areft^* - - P° r sufficiently small r and for e = 7~i, let //* denote 
/x*^. It clearly suffices to prove that 

Prt(£c(c)Mc) A KB 1 ) = P V (C|5), 

where by equality we also mean that one side is defined iff the other is also defined. It is 
easy to verify that a point u in A K satisfies T(KB'[f]) iff the corresponding distribution 
/j, e-satisfies 1Z. Therefore, the maximum-entropy point v of S T [KB'] (which is unique, 
by linearity) corresponds precisely to fi* . Now, there are two cases: either fi*(B) > or 
fi*(B) = 0. In the first case, by Remark 4.13, Pr^* (£g(c)) = F^ B ^(v), so the latter is 
also positive. This also implies that v is consistent with the constraints T(ip(c)) entailed 
by ip(c) = £b(c), so that v is also the unique maximum-entropy point of S T [KB] (where 
KB = £b(c) A KB 1 ). We can therefore use Corollary 4.10 and Remark 4.13 to conclude that 
P r oo(£c( c )|yX-5) = F[£ c ( c }\£ B ( c y\(v) = Pr /ti *(C|5) and that all three terms are well-defined. 
Assume, on the other hand, that fi*(B) = 0, so that Pr /ti *(C|5) is not well-defined. In this 
case, we can use a known result (see (Paris & Vencovska, 1989)) for the maximum-entropy 
point over a space defined by linear constraints, and conclude that for all /j, satisfying 1Z, 
necessarily fi(B) = 0. Using the connection between distributions /j, satisfying 1Z and points 
u in S T [KB'], we conclude that this is also the case for all u £ S T [KB']. By part (a) of 
Theorem B.2, this means that in any world satisfying KB 1 , the proportion | |£b(2;)| 1^ is 
necessarily 0. Thus, KB 1 is inconsistent with £b(c), and Pr^ (^c'(c)|^s(c) A KB 1 ) is also not 
well-defined. | 



Appendix D. Proofs for Section 4.4 

Theorem 4.24: // KB and f > are stable for a* then Vi^a^KB) = 1. 

Proof: By Theorem 3.14, it suffices to show that there is some open neighborhood con- 
taining Q, the maximum-entropy points of S T [KB], such that every world W of KB in this 
neighborhood has o~{W) = a* . So suppose this is not the case. Then there is some sequence 
of worlds W\, W2, ■ ■ ■ such that (Wi,r) |= KB A ->a* and lim^oo min^gg |7r(Wi) — v\ = 0. 
Since A K is compact the sequence ir(W\), tt(W2), ■ ■ ■ must have at least one accumulation 
point, say u. This point must be in the closure of the set Q. But, in fact, Q is a closed 
set (because entropy is a continuous function) and so u £ Q. By part (a) of Theorem B.2, 
vr(Wi) £ S T [KB A — kt*] for every i and so, since this space is closed, u £ S T [KB A ->cr*] as 
well. But this means that u is an unsafe maximum-entropy point, contrary to the definition 
and assumption of stability. | 

In the remainder of this section we prove Theorem 4.28. For this purpose, fix KB = 
A KB 1 , (p, and a* to be as in the statement of this theorem, and let v be the unique 
maximum-entropy point of S°[KB]. 
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Let Z = {ci, . . . , c m } be the set of constant symbols appearing in ip and in (p. Due to 
the separability assumption, KB' contains none of the constant symbols in Z. Let yf be 
the formula Ai^j c i 7^ c j- We first prove that yf has probability 1 given KB' . 

Lemma D.l: For yf and KB' as above, Pr 00 (%^ \KB') = 1. 

Proof: We actually show that Pr 00 (-i%^ \KB') = 0. Let c and c' be two constant symbols 
in {ci, . . . , c m } and consider Pr^c = c'\KB'). We again use the direct inference technique. 
Note that for any world of size N the proportion expression ||a; = a^'H^/ denotes exactly 
1/N. It is thus easy to see that Pr^dla: = s'H^^/ ~i 0\KB') = 1 (for any choice of i). Thus, 
by Theorem 3.16, Pr^c = c'\KB') = Pr^c = c'\KB' A ||a; = a^'H^/ ~i 0). But since c and 
c' appear nowhere in KB' we can use Theorem 4.1 to conclude that Pr^c = c'\KB') = 0. 
It is straightforward to verify that, since ~^yf is equivalent to a finite disjunction, each 
disjunct of which implies c = c' for at least one pair of constants c and c', we must have 
Pi oo (^\KB') = 0. | 

As we stated in Section 4.4, our general technique for computing the probability of an 
arbitrary formula (p is to partition the worlds into a finite collection of classes such that (p 
behaves uniformly over each class and then to compute the relative weights of the classes. 
As we show later, the classes are essentially defined using complete descriptions. Their 
relative weight corresponds to the probabilities of the different complete descriptions given 
KB. 

Proposition D.2: Let KB = KB' A and v be as above. Assume that Pr 00 (-i/'|ir5 / ) > 0. 
Let D be a complete description over Z that is consistent with ip. 

(a) If D is inconsistent with yf , then Pr 00 (D|iT5) = 0. 

(b) If D is consistent with yf , then 

F[D](v) 



Pr 00 (£>|iT5) 



T,D>eA(i>A X *) F [D'](v) 
Proof: First, observe that if all limits exist and the denominator is nonzero, then 

By hypothesis, the denominator is indeed nonzero. Furthermore, by Lemma D.l, Pr 00 (-i%^ A 
i/>\KB') < Pr^y^lKB') = 0. Hence Pr 00 (x^|^5) = Pr 00 (x^|^5 / A V) = 1- We can 
therefore use Theorem 3.16 to conclude that 

Pr 00 (£>|iT5) = Pr 00 (£>|iT5 A y+). 

Part (a) of the proposition follows immediately. 

To prove part (b), recall that is equivalent to the disjunction V EeA(ip) ■ By simple 
probabilistic reasoning, the assumption that Pr 00 (-i/'|ir5 / ) > 0, and part (a), we conclude 
that 

P roo (D A il>\KB') _ Pr^D A il>\KB') 



Pr 00 (£>|V' A KB') 



ViM\KB') Ese^Ax*) P?oo(E\KB') 
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By assumption, D is consistent with %^ and is in A(ip). Since D is a complete description, 
we must have that D => ip is valid. Thus, the numerator on the right-hand side of this 
equation is simply Pr 00 (D|iT5 / ). Hence, the problem of computing Pr 00 (D|iT5) reduces to 
a series of computations of the form Pr 00 (^J|iT5 / ) for various complete descriptions E. 

Fix any such description E. Recall that E can be decomposed into three parts: the 
unary part E 1 , the non-unary part E >1 , and the equality part E = . Since E is in A{x^)i 
we conclude that yf is equivalent to E = . Using Theorem 3.16 twice and some probabilistic 
reasoning, we get: 

Pr 00 (^J >1 A E 1 A E = \KB') = Pr oc ( J E >1 A E 1 A E=\KB' A x # ) 

= Pr 00 (i? >1 A E 1 \KB' A yf) 

= Pr 00 (^ >1 |ir5' A X + A E 1 ) ■ Fx^E^KB' A X + ) 

= Pr 00 (^ >1 |ir5' A X + A tf 1 ) • Fx^E^KB'). 

In order to simplify the first expression, recall that none of the predicate symbols in E yl 
occur anywhere in KB 1 A %^ A E 1 . Therefore, the probability of E >1 given KB' A %^ is 
equal to the probability that the elements denoting the \Z\ (different) constants satisfy some 
particular configuration of non-unary properties. It should be clear that, by symmetry, all 
such configurations are equally likely. Therefore, the probability of any one of them is a 
constant, equal to 1 over the total number of configurations. 14 Let p denote the constant 
which is equal to Pr 00 ( J E >1 |iT J B / A %^ A E 1 ) for all E. 

The last step is to show that, if E 1 is equivalent to AjLi Ai-(cj), then Pt (X) (E 1 \KB') = 
F[D](v): 

m mm 

Proo( A ^:( c j)\ KB ') = Proo(A ll (c 1 )| /\ Ai^Cj) A KB 1 ) • Pr 0O (A l2 (c 2 )| f\ A^c,) A KB') 

3=1 3=2 3=3 

• • • J- J- oo 

) A KB') ■ Pr 0O ( J 4 lm ( Cm )|iT5') 
= ■ ... • Vi m (using Theorem 4.11; see below) 
= F[D](v). 

The first step is simply probabilistic reasoning. The second step uses m applications of 
Theorem 4.11. It is easy to see that Ai^Cj) is a simple query for Ai j+1 (cj+i) A ... A 
-A» m ( c m) A KB'. We would like to show that 

m 

Pr 00 ( J 4 lj ( Cj )| /\ A H { Cl ) A KB') = Fx^A^c^KB') = Vij , 

1=3 + 1 

where Theorem 4.11 justifies the last equality. To prove the first equality, we show that for 
all j, the spaces S°[KB'] and S^fA^Lj+i Ai t (cj) A KB'] have the same maximum-entropy 
point, namely v. This is proved by backwards induction; the j = m case is trivially true. 
The difference between the (j — l)st and jth case is the added conjunct Ai^Cj), which 
amounts to adding the new constraint Wi- > 0. There are two possibilities. First, if V{- > 0, 



14. Although we do not need the value of this constant in our calculations below, it is in fact easy to verify 
that its value is fliig(#_*) ^ m ^ \ where m = \2\. 
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then v satisfies this new constraint anyway and so remains the maximum-entropy point, 
completing this step of the induction. If V{- = this is not the case, and indeed, the 
property we are trying to prove can be false (for j < m). But this does not matter, because 
we then know that Pt 00 (A 1j (c j )\ f\f =J+1 A ll (c l )AKB') = Pt 00 (A 1j (c j )\KB') = v tj = 0. Since 
both of the products in question include a factor, it is irrelevant as to whether the other 
terms agree. 

We can now put everything together to conclude that 

V^{D\KB) = ^ D \ KB ') = ^ 

proving part (b). | 

We now address the issue of computing Pr 00 (</3|iT5) for an arbitrary formula (p. To do 
that, we must first investigate the behavior of Pr^ (</3|iT5) for small f. Fix some sufficiently 
small f > 0, and let Q be the set of maximum-entropy points of S T [KB]. Assume KB and 
f are stable for a* . By definition, this means that for every v £ Q, we have o~{y) = a* . Let 
/ be the set of i's for which a* contains the conjunct 3xAi(x). Since o{y) = a* for all v, 
we must have that V{ > for all i E I. Since Q is a closed set, this implies that there exists 
some e > such that for all v £ Q and for all i £ /, we have V{ > e. Let 9[e] be the formula 

/\ H^aOlU > e. 
iei 

The following proposition is now easy to prove: 

Proposition D.3: Suppose that KB and f are stable for a* and that Q, i, 9[e], and yf 
are as above. Then 

Pt ¥ 00 (< P \KB)= ?C(<f\KB' A9[e}Aa* AD)-Fil(D\KB). 

Proof: Clearly, 9[e] satisfies the conditions of Corollary 3.14, allowing us to conclude that 
Pr^ ) (6'[e]|ir5) = 1. Similarly, by Theorem 4.24 and the assumptions of Theorem 4.28, 
we can conclude that Vi T 00 ( y a*\KB) = 1. Since the conjunction of two assertions that have 
probability 1 also has probability 1, we can use Theorem 3.16 to conclude that Pr^ (</3|iT5) = 
V/MKB A9[e] A a*). 

Now, recall that ip is equivalent to the disjunction V.De.4('i/>) ® ■ ^ straightforward 
probabilistic reasoning, we can therefore conclude that 

Pit^KB A 9[e] A a*) = ]T PiU<f\KB A 9[e] A a* A D) ■ Pii(D\KB A 9[e] A a*). 

DeA(i>) 

By Theorem 3.16 again, Pr^( J D|iT5 A 9[e] A a*) = Pr^( J D|iT5). The desired expression now 
follows. I 

We now simplify the expression Vi^^KB A 9[e] A a* A D). 
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Proposition D.4: For ip, KB, a*, D, and 9[e] as above, if Pr^(_D|ifi?) > 0, then 

V^MKB A 0[e] A a* A D) = Pr 00 ( ¥3 |a* A D), 

and its value is either or 1. Note that since the latter probability only refers to first-order 
formulas, it is independent of the tolerance values. 

Proof: That the right-hand side is either or 1 is proved in (Grove et al., 1993b), where it 
is shown that the asymptotic probability of any pure first-order sentence when conditioned 
on knowledge of the form a* A D (which is, essentially, what was called a model description 
in (Grove et al., 1993b)) is either or 1. Very similar techniques can be used to show that 
the left-hand side is also either or 1, and that the conjuncts KB A 9[e] do not affect this 
limit (so that the left-hand side and the right-hand side are in fact equal). We briefly sketch 
the relevant details here, referring the reader to (Grove et al., 1993b) for full details. 

The idea (which actually goes back to Fagin (1976)) is to associate with a model descrip- 
tion such as a* A D a theory T which essentially consists of extension axioms. Intuitively, 
an extension axiom says that any finite substructure of the model defined by a complete 
description D 1 can be extended in all possible ways definable by another description D" . We 
say that a description D" extends a description D' if all conjuncts of D' are also conjuncts in 
D" . An extension axiom has the form Vsi, . . . , Xj (D' =>■ 3xj + i D"), where D' is a complete 
description over X = {x\, . . - ,Xj} and D" is a complete description over X U {xj + i}, such 
that D" extends D 1 , both D 1 and D" extend D, and both are consistent with a*. It is 
then shown that (a) T is complete (so that for each formula £, either T |= £ or T |= 
and (b) if £ G T then Pr 00 (£|cr* A D) = 1. From (b) it easily follows that if T |= f , then 
P r oo(£| cr * A D) is also 1. Using (a), the desired 0-1 law follows. The only difference from 
the proof in (Grove et al., 1993b) is that we need to show that (b) holds even when we 
condition on KB A 9[e] A a* A D, instead of just on a* A D. 

So suppose £ is the extension axiom Vsi, . . . , Xj (D' =>■ 3xj + i D"). We must show that 
Pr 00 (£|iT5 A 6[e] A a* A D) = 1. We first want to show that the right-hand side of the 
conditional is consistent. As observed in the previous proof, it follows from Theorem 3.16 
that Pr 00 ( J D|iT5) = Vi^^KB A 9[e] A a*). Since we are assuming that Pr 00 ( J D|iT5) > 0, it 
follows that Proo(iT5 A 9[e] A a* A D) > 0, and hence KB A 9[e] A a* AD must be consistent. 

Fix a domain size N and consider the set of worlds satisfying KB A 9[e] A a* A D. Now 
consider some particular j domain elements, say d\, . . .,dj, that satisfy D'. Observe that, 
since D' extends D, the denotations of the constants are all among d\, . . . , dj. For a given 
d {d\, . . . , dj}, let B(d) denote the event that d\, . . . ,dj,d satisfy D", given that d\, . . . ,dj 
satisfy D' . What is the probability of B(d) given KB A 9[e] A a* A D1 First, note that since 
d does not denote any constant, it cannot be mentioned in any way in the knowledge base. 
Thus, this probability is the same for all d. The description D" determines two types 
of properties for Xj + \. The unary properties of Xj+\ itself — i.e., the atom A{ to which 
Xj + i must belong — and the relations between x J+ i and the remaining variables X\, . . . , Xj 
using the non-unary predicate symbols. Since D" is consistent with a* , the description a* 
must contain a conjunct 3a; Ai{x) if D" implies Ai(xj + i). By definition, 9[e] must therefore 
contain the conjunct | |Ai(a;)| \ x > e. Hence, the probability of picking d in A{ is at least 
e. For any sufficiently large N, the probability of picking d in A{ which is different from 
d\, . . .,dj (as required by the definition of the extension axiom) is at least e/2 > 0. The 
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probability that d\, . . . , dj, d also satisfy the remaining conjuncts of D", given that d is 
in atom A{ and di,...,dj satisfy D', is very small but bounded away from 0. (For this 
to hold, we need the assumption that the non-unary predicates are not mentioned in the 
KB.) This is the case because the total number of possible ways to choose the properties 
of d (as they relate to d\, . . . , dj) is independent of N . We can therefore conclude that the 
probability of B(d) (for sufficiently large N), given that di,...,dj satisfy D, is bounded 
away from by some A independent of N. Since the properties of an element d and its 
relation to d\, . . .,dj can be chosen independently of the properties of a different element 
d', the different events B(d), B(d'), . . . are all independent. Therefore, the probability that 
there is no domain element at all that, together with di,...,dj, satisfies D" is at most 
(1 — A) N ~i. This bounds the probability of the extension axiom being false, relative to 
fixed di, . . . , dj. There are (^) ways of these choosing j elements, so the probability of the 

axiom being false anywhere in a model is at most (^)(1 — A)^ - - 7 . This tends to as N goes 
to infinity. Therefore, the extension axiom \fx\, . . . , Xj (D' =>■ 3xj + i D") has asymptotic 
probability 1 given KB A 0[e] A a* A D, as desired. | 

Finally, we are in a position to prove Theorem 4.28. 

Theorem 4.28: Let <p be a formula in £~ and let KB = KB 1 A ip be an essentially 
positive knowledge base in Cf which is separable with respect to (p. Let Z be the set of 
constants appearing in (p or in ( so that KB 1 contains none of the constants in Z) and 
let yf be the formula Acc'e2 c 7^ c ' '• Assume that there exists a size description a* such 
that, for all f > 0, KB and f are stable for a* , and that the space S°[KB] has a unique 
maximum- entropy point v. Then 



if the denominator is positive. 

Proof: Assume without loss of generality that ip mentions all the constant symbols in (p, 
so that A(tp A yf) C A(tp). By Proposition D.3, 

FC(<P\KB)= J2 P4(^A0[e]Aa*A£)-Prt(£|in?). 

DeA(i>) 

Note that we cannot easily take limits of Pr^ (</3|iT5 A 9[e] A a* A D) as f goes to 0, because 
this expression depends on 9[e] and the value of e used depends on the choice of f. However, 
applying Proposition D.4, we get 

Fil(<P\KB)= E Proo(yk*A0)-Pr£(0|ira). 

DeA(i>) 

We can now take the limit as f goes to 0. To do this, we use Proposition D.2. The 
hypotheses of the theorem imply that Pr 00 (-i/'|ir5 / ) > (for otherwise, the denominator 
J^DeAiipAx*) F[D](,v) would be zero). Part (a) of the proposition tells us we can ignore those 
complete descriptions that are inconsistent with We can now apply part (b) to get the 
desired result. | 
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